# 🎓 Databricks Workshop: Healthcare Payer Analytics
**Hands-on Introduction to Databricks ETL and Analytics**

---

## 📚 Workshop Objectives

By the end of this workshop, you will be able to:

1. ✅ Understand the **Medallion Architecture** (Bronze, Silver, Gold layers)
2. ✅ Build data pipelines using **SQL** and **PySpark**
3. ✅ Load raw data using `COPY INTO` commands
4. ✅ Transform and clean data with best practices
5. ✅ Create analytics-ready tables for business insights
6. ✅ Visualize data using Databricks native capabilities

---

### 🏥 Dataset Overview

We'll work with **healthcare payer data** including:
- **Members**: Health plan enrollees
- **Claims**: Medical claim submissions
- **Providers**: Healthcare providers (doctors, clinics)
- **Diagnoses**: Diagnosis codes from claims
- **Procedures**: Medical procedures performed

---



## 📑 Table of Contents

1. **[Introduction](#introduction)**
   - What is a Lakehouse?
   - Unity Catalog Overview
   - Medallion Architecture
   - Managed Tables & Predictive Optimization

2. **[Modeling Concepts](#modeling-concepts)**
   - Dimensional Modeling
   - Data Vault Architecture
   - Sample Data Model

3. **[Setup](#setup)**
   - Configure Widgets & Parameters
   - Create Catalog & Schemas
   - Setup Unity Catalog Volumes
   - Download Sample Data

4. **[Bronze Layer - Ingest Raw Data](#bronze-layer)**
   - Understanding Bronze Layer
   - Using COPY INTO
   - Loading with PySpark
   - Data Exploration Exercises

5. **[Silver Layer - Transform & Clean](#silver-layer)**
   - Understanding Silver Layer
   - SQL Transformations
   - PySpark Transformations
   - Data Quality Checks

6. **[Gold Layer - Analytics Ready](#gold-layer)**
   - Understanding Gold Layer
   - Enriched Fact Tables
   - Provider Performance Dashboard
   - Time-Series Analysis
   - Cohort Analysis

7. **[Analytics & Visualization](#analytics-visualization)**
   - Using Databricks AI Assistant
   - Statistical Analysis
   - Creating Visualizations (6+ types)
   - Advanced Analytics Examples

8. **[Hands-On Exercises](#exercises)**
   - Exercise 1: Provider Specialty Analysis
   - Exercise 2: Time-Based Claims Analysis
   - Exercise 3: Cost Outlier Detection
   - Exercise 4: Member Health Risk Scoring

9. **[Best Practices & Performance](#best-practices)**
   - Performance Optimization
   - Data Quality Best Practices
   - Delta Lake Maintenance
   - Architecture Guidelines

10. **[Quick Reference Guide](#reference-guide)**
    - PySpark Operations
    - SQL Commands
    - Databricks Utilities

11. **[Workshop Summary & Next Steps](#summary)**
    - What You Accomplished
    - Continue Learning
    - Resources & Certifications

---

> **💡 Workshop Tip**: This is a hands-on workshop! Execute each cell as you go and experiment with the code. Don't hesitate to ask questions or use the Databricks AI Assistant.

---


# Introduction
This guide shows how to design a **Databricks Medallion Architecture** (Bronze, Silver, Gold) pipeline with SQL, using sample tables and realistic transformations relevant to a **Healthcare Payer**. All code is written to work in Databricks SQL notebooks.


# What is a lakehouse?

1. **Hybrid Architecture:**  
   A lakehouse combines the best of data lakes (flexible, cheap storage) and data warehouses (structured, fast analytics), providing transactional and governance features on top of open cloud storage.

2. **ACID Transactions and Schema Governance:**  
   Lakehouses support ACID transactions for consistent concurrent data access and enforce schema management, which is essential for data integrity and compliance.

3. **Open and Decoupled:**  
   They use open file formats (like Parquet), decouple compute from storage for flexible scalability, and allow access by a variety of analytics, BI, and machine learning tools.

4. **Supports All Workloads and Data Types:**  
   The architecture enables SQL analytics, data science, machine learning, and can handle structured, semi-structured, and unstructured data (including images, text, video).

5. **Single Platform, Enterprise Ready:**  
   With features like real-time streaming, end-to-end governance, access control, and data discovery tools, lakehouses reduce complexity—allowing enterprises to manage all data and analytics needs in one unified system.
![](https://www.databricks.com/wp-content/uploads/2020/01/data-lakehouse-new.png)

## Unity Catalog

[Unified and open governance for data and AI in the Lakehouse](https://www.databricks.com/product/unity-catalog#features)

Eliminate silos, simplify governance and accelerate insights at scale:

- Centralizes governance, access control, auditing, and data discovery for all data and AI assets across Databricks workspaces.
- Enables fine-grained, consistent data access policies (row- and column-level), defined once and applied everywhere.
- Provides comprehensive data lineage and audit logs, showing how and by whom data is accessed and transformed.
- Supports data discovery, tagging, and documentation, making it easier to find and understand datasets and models.
- Works across multiple clouds and supports open formats (Delta, Parquet, etc.), avoiding vendor lock-in and enabling broad interoperability.
- Allows secure data and AI sharing within and outside the organization, including clean rooms and partner collaborations.
- Provides built-in monitoring for data quality, freshness, and usage, helping ensure compliance and rapid troubleshooting.
- Integrates tightly with the catalog/schema/object model, enhancing organization and security for all managed data assets.

![](https://www.databricks.com/sites/default/files/2025-05/header-unity-catalog.png?v=1748513086)

[Unity Catalog Search & Data Explorer](https://app.getreprise.com/launch/96mpAqy/)

[Exploring Lineage and Governance with Unity Catalog](https://app.getreprise.com/launch/MnqjQDX/)

[A Comprehensive Guide to Data and AI Governance](https://www.databricks.com/sites/default/files/2024-08/comprehensive-guide-to-data-and-ai-governance.pdf)






## Medallion lakehouse architecture

In this example, we will be following the **medallion lakehouse architecture**. The medallion architecture is a data design pattern to organize data in a lakehouse. The goal is to progressively improve the quality and structure of the data as it flows through each layer (Bronze [**raw**] → Silver [**staging**] → Gold [**main**]).

1. **Bronze layer**: the raw, unvalidated data
2. **Silver**: cleansed and conformed data
3. **Gold**: curated business-level tables

<img src="https://www.databricks.com/sites/default/files/inline-images/building-data-pipelines-with-delta-lake-120823.png?v=1702318922" alt="Managed Tables" width="600" height="500">

## Managed tables

[How Unity Catalog Managed Tables Automate Performance at Scale](https://www.databricks.com/blog/how-unity-catalog-managed-tables-automate-performance-scale) with [Predictive Optimization](https://learn.microsoft.com/en-us/azure/databricks/optimizations/predictive-optimization)


<!-- ![](https://www.databricks.com/sites/default/files/inline-images/image2_48.png?v=1751297384) -->

<img src="https://www.databricks.com/sites/default/files/inline-images/image2_48.png?v=1751297384" alt="Managed Tables" width="600" height="500">


[Faster Queries: 20X query latency reduction](https://www.databricks.com/blog/predictive-optimization-automatically-delivers-faster-queries-and-lower-tco)

**Predictive Optimization** in Databricks automates table management by leveraging Unity Catalog and the Data Intelligence Platform. This innovative feature currently runs the following optimizations for Unity Catalog managed tables:

* **OPTIMIZE** - Triggers incremental clustering for enabled tables. Improves query performance by optimizing file sizes.
* **VACUUM** - Reduces storage costs by deleting data files no longer referenced by the table.
* **ANALYZE** - Triggers incremental update of statistics to improve query performance. 


<img src="https://www.databricks.com/sites/default/files/styles/max_1000x1000/public/2024-05/db-976-blog-img-og.png?itok=qWBT8VA-&v=1717158571" alt="Managed Tables" width="600" height="500">

**Compaction** - This enhances query performance by optimizing file sizes, ensuring that data retrieval is efficient.

**Liquid Clustering** - This technique incrementally clusters incoming data, enabling optimal data layout and efficient data skipping.



# Databricks Medallion Pipeline for a Healthcare Payer


## Modeling Concepts

Databricks fully supports both **dimensional modeling** (Kimball/star schema) and **Inmon-style, Data Vault architectures (hubs, satellites, links)** on the Lakehouse platform. For dimensional models, you can build classic star and snowflake schemas directly with SQL, benefiting from ACID transactions and scalable Delta Lake tables.

For Inmon/Data Vault use cases, Databricks provides rich support for hub-and-satellite models that address core enterprise needs for history, auditability, and extensibility—find end.

The Lakehouse approach lets you mix these styles as needed within a single platform, so you can incrementally land data in Raw Vault/EDW structures and later expose it as dimensional marts—all with Delta Live Tables, fine-grained security, and open formats.

Key blog resources:

[Implementing Dimensional Modeling](https://www.databricks.com/blog/implementing-dimensional-data-warehouse-databricks-sql-part-1)

[Implementing Data Vault/Hub-Satellite](https://www.databricks.com/blog/2022/06/24/prescriptive-guidance-for-implementing-a-data-vault-model-on-the-databricks-lakehouse-platform.html) 

[Data Vault Best Practices](https://www.databricks.com/blog/data-vault-best-practice-implementation-lakehouse)

<div style="display: flex; justify-content: space-between;">
  <img src="https://user-gen-media-assets.s3.amazonaws.com/gpt4o_images/5c87faea-3e60-4f71-826d-42d04f6cdc0b.png" alt="Managed Tables" width="400" height="350">
  <img src="https://user-gen-media-assets.s3.amazonaws.com/gpt4o_images/6826c275-d462-4c07-a978-43fe9c40f3ed.png" alt="Managed Tables" width="400" height="350">
</div>






## Sample Data Model

For a payer, commonly used tables include:

- **Members**: members enrolled in a health plan
- **Claims**: medical claim submissions
- **Providers**: healthcare providers (doctors, clinics)
- **Diagnoses**: claim diagnosis codes
- **Procedures**: procedures/services performed

Each table should have at least 50 rows.

<img src="https://user-gen-media-assets.s3.amazonaws.com/gpt4o_images/bdd54dc0-f3c7-4975-80a3-0017ebdb121c.png" alt="Managed Tables" width="400" height="300">





## Table	Key Columns

**Members**	member_id, first_name, last_name, birth_date, gender, plan_id, effective_date

**Claims**	claim_id, member_id, provider_id, claim_date, total_charge, claim_status

**Providers**	provider_id, npi, provider_name, specialty, address, city, state

**Diagnoses**	claim_id, diagnosis_code, diagnosis_desc

**Procedures**	claim_id, procedure_code, procedure_desc, amount

# SETUP

In [0]:
dbutils.widgets.text("catalog", "my_catalog", "Catalog")
dbutils.widgets.text("bronze_db", "payer_bronze", "Bronze DB")
dbutils.widgets.text("silver_db", "payer_silver", "Silver DB")
dbutils.widgets.text("gold_db", "payer_gold", "Gold DB")

catalog = dbutils.widgets.get("catalog")
bronze_db = dbutils.widgets.get("bronze_db")
silver_db = dbutils.widgets.get("silver_db")
gold_db = dbutils.widgets.get("gold_db")

path = f"/Volumes/{catalog}/{bronze_db}/payer/files/"

In [0]:
print(f"Catalog: {catalog}")
print(f"Bronze DB: {bronze_db}")
print(f"Silver DB: {silver_db}")
print(f"Gold DB: {gold_db}")
print(f"Path: {path}")

In [0]:
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")

In [0]:
spark.sql(f"USE CATALOG {catalog}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {bronze_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {silver_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {gold_db}")

Create new **Volumes** as below and upload shared files to your volumes.

(schema) payer_bronze \
|--- payer/files/ \
|------ claims \
|------ members \
|------ providers \
|------ diagnoses \
|------ procedures


In [0]:
spark.sql(f"CREATE VOLUME IF NOT EXISTS {bronze_db}.payer")

In [0]:
# Create the volume and folders
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/claims")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/diagnosis")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/procedures")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/members")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/providers")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/downloads")

In [0]:
import requests
import zipfile
import io
import os
import shutil

# Define the URL of the ZIP file
url = "https://github.com/bigdatavik/databricksfirststeps/blob/6b225621c3c010a2734ab604efd79c15ec6c71b8/data/Payor_Archive.zip?raw=true"

# Download the ZIP file
response = requests.get(url)
zip_file = zipfile.ZipFile(io.BytesIO(response.content))

# Define the base path
base_path = f"/Volumes/{catalog}/{bronze_db}/payer/downloads" 

# Extract the ZIP file to the base path
zip_file.extractall(base_path)

# Define the paths
paths = {
    "claims.csv": f"{base_path}/claims",
    "diagnoses.csv": f"{base_path}/diagnosis",
    "procedures.csv": f"{base_path}/procedures",
    "member.csv": f"{base_path}/members",
    "providers.csv": f"{base_path}/providers"
}

# Create the destination directories if they do not exist
for dest_path in paths.values():
    os.makedirs(dest_path, exist_ok=True)

# Move the files to the respective directories
for file_name, dest_path in paths.items():
    source_file = f"{base_path}/{file_name}"
    if os.path.exists(source_file):
        os.rename(source_file, f"{dest_path}/{file_name}")



In [0]:
%python
# Copy the files to the specified directories and print the paths
shutil.copy(f"{base_path}/claims/claims.csv", f"/Volumes/{catalog}/{bronze_db}/payer/files/claims/claims.csv")
print(f"Copied to /Volumes/{catalog}/{bronze_db}/payer/files/claims/claims.csv")

shutil.copy(f"{base_path}/diagnosis/diagnoses.csv", f"/Volumes/{catalog}/{bronze_db}/payer/files/diagnosis/diagnosis.csv")
print(f"Copied to /Volumes/{catalog}/{bronze_db}/payer/files/diagnosis/diagnosis.csv")

shutil.copy(f"{base_path}/procedures/procedures.csv", f"/Volumes/{catalog}/{bronze_db}/payer/files/procedures/procedures.csv")
print(f"Copied to /Volumes/{catalog}/{bronze_db}/payer/files/procedures/procedures.csv")

shutil.copy(f"{base_path}/members/member.csv", f"/Volumes/{catalog}/{bronze_db}/payer/files/members/members.csv")
print(f"Copied to /Volumes/{catalog}/{bronze_db}/payer/files/members/members.csv")

shutil.copy(f"{base_path}/providers/providers.csv", f"/Volumes/{catalog}/{bronze_db}/payer/files/providers/providers.csv")
print(f"Copied to /Volumes/{catalog}/{bronze_db}/payer/files/providers/providers.csv")

# 🚀 Let's Build Your First Data Pipeline!

---

## Workshop Roadmap

```
📥 Bronze Layer    →    🔧 Silver Layer    →    ⭐ Gold Layer    →    📊 Analytics
   (Raw Data)          (Cleaned Data)        (Business Tables)      (Insights)
```

In the following sections, we'll build a complete data pipeline following the **Medallion Architecture**:

1. **Bronze Layer**: Ingest raw CSV files into Delta tables
2. **Silver Layer**: Clean, deduplicate, and transform data
3. **Gold Layer**: Create enriched analytics tables
4. **Analytics**: Generate insights and visualizations

Let's get started! 🎉

In [0]:
# %sql
# -- Set the catalog and schema
# CREATE CATALOG IF NOT EXISTS my_catalog;
# USE CATALOG my_catalog;

# -- Create bronze schema
# CREATE SCHEMA IF NOT EXISTS payer_bronze;

# 📥 Bronze Layer – Ingest Raw Data

---

## What is the Bronze Layer?

The **Bronze Layer** is the landing zone for raw data. Here we:
- 📂 Load data "as-is" from source files (CSV, JSON, Parquet, etc.)
- 💾 Store in Delta Lake format for ACID transactions
- 📝 Apply minimal transformation (just schema inference)
- ⏱️ Keep historical data for audit and reprocessing

> **💡 Best Practice**: Use `COPY INTO` for incremental, idempotent loading. It automatically skips already-loaded files!

---



## Step 1: Verify Source Files

Let's first check that our source files are available:

In [0]:
%sql
LIST '/Volumes/my_catalog/payer_bronze/payer/files/claims/'

## Step 2: Load Data with COPY INTO

### 📖 Understanding COPY INTO

`COPY INTO` is Databricks' recommended command for loading data from cloud storage into Delta tables.

**Key Benefits:**
- ✅ **Idempotent**: Safely re-run without duplicating data
- ✅ **Incremental**: Only loads new files automatically
- ✅ **Schema Evolution**: Can merge new columns with `mergeSchema` option
- ✅ **Atomic**: Either succeeds completely or rolls back

**Syntax:**
```sql
COPY INTO <table_name>
FROM '<source_path>'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true')
COPY_OPTIONS('mergeSchema' = 'true')
```

📚 **Learn More:**
- [COPY INTO Documentation](https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/delta-copy-into)
- [COPY INTO Examples](https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/copy-into/)


### Loading Data with SQL

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.claims_raw;
COPY INTO payer_bronze.claims_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/claims/')
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true', 'force' = 'true');

-- NOTE: 'force = true' is used here for demo purposes only to reload all files every time. In production, omit this option so COPY INTO only processes new data files.

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.diagnosis_raw;
COPY INTO payer_bronze.diagnosis_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/diagnosis/')

FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.members_raw;
COPY INTO payer_bronze.members_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/members/')

FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.procedures_raw;
COPY INTO payer_bronze.procedures_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/procedures/')
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.providers_raw;
COPY INTO payer_bronze.providers_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/providers/')
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


### 🐍 Alternative: Loading Data with PySpark

While SQL is great for batch loading, PySpark gives you more programmatic control. Here's how to load the same data using PySpark:

In [0]:
# Example: Load data using PySpark
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, DateType

# Option 1: Let Spark infer the schema
claims_df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/Volumes/my_catalog/payer_bronze/payer/files/claims/")

# Display first 10 rows
display(claims_df.limit(10))

# Show schema
print("Claims Schema:")
claims_df.printSchema()

# Get row count
print(f"\nTotal rows loaded: {claims_df.count()}")

# Write to Delta table (this creates or replaces the table)
# claims_df.write \
#     .format("delta") \
#     .mode("overwrite") \
#     .saveAsTable("payer_bronze.claims_raw_pyspark")


### 🎯 Exercise: Query Bronze Tables

Now that we've loaded data into Bronze tables, let's explore what we have:

**Try these queries yourself:**


In [0]:
# Query: Count records in each bronze table using PySpark
from pyspark.sql.functions import lit, count

tables = ['claims_raw', 'members_raw', 'providers_raw', 'diagnosis_raw', 'procedures_raw']
row_counts = []

for table in tables:
    cnt = spark.table(f"payer_bronze.{table}").count()
    row_counts.append((table, cnt))
    
# Create DataFrame to display results
result_df = spark.createDataFrame(row_counts, ["table_name", "row_count"])
display(result_df)


# 🔧 Silver Layer – Transform, Clean, and Join

---

## What is the Silver Layer?

The **Silver Layer** is where we transform raw data into clean, validated, and enriched datasets. Here we:

- 🧹 **Clean**: Remove nulls, trim whitespace, fix data quality issues
- 🔄 **Transform**: Cast data types, standardize formats
- 🗑️ **Deduplicate**: Remove duplicate records based on business keys
- 🔍 **Validate**: Apply business rules and data quality checks
- 📊 **Enrich**: Join related tables, calculate derived columns

> **💡 Best Practice**: Silver tables should be "analytics-ready" – cleaned, validated, and properly typed!



## Step 1: Transform Bronze to Silver (SQL)

Let's clean and transform our Bronze tables. We'll demonstrate with multiple examples using both **SQL** and **PySpark**.

In [0]:
%sql
-- Create silver schema
CREATE SCHEMA IF NOT EXISTS payer_silver;


-- Members: select relevant fields, cast types, remove duplicates
CREATE OR REPLACE TABLE payer_silver.members AS
SELECT
  DISTINCT CAST(member_id AS STRING) AS member_id,
  TRIM(first_name) AS first_name,
  TRIM(last_name) AS last_name,
  CAST(birth_date AS DATE) AS birth_date,
  gender,
  plan_id,
  CAST(effective_date AS DATE) AS effective_date
FROM payer_bronze.members_raw
WHERE member_id IS NOT NULL;


-- Claims: remove duplicates, prepare data
CREATE OR REPLACE TABLE payer_silver.claims AS
SELECT
  DISTINCT claim_id,
  member_id,
  provider_id,
  CAST(claim_date AS DATE) AS claim_date,
  ROUND(total_charge, 2) AS total_charge,
  LOWER(claim_status) AS claim_status
FROM payer_bronze.claims_raw
WHERE claim_id IS NOT NULL AND total_charge > 0;


-- Providers: deduplicate
CREATE OR REPLACE TABLE payer_silver.providers AS
SELECT
  DISTINCT provider_id,
  npi,
  provider_name,
  specialty,
  address,
  city,
  state
FROM payer_bronze.providers_raw
WHERE provider_id IS NOT NULL;


## Step 2: Transform with PySpark

Now let's see how to do the same transformations using PySpark. This approach is more flexible for complex business logic.

### Example: Transform Procedures Table with PySpark


In [0]:
from pyspark.sql.functions import col, trim, upper, round as spark_round, when, regexp_replace

# Read from Bronze
procedures_bronze = spark.table("payer_bronze.procedures_raw")

# Clean and cast the amount column
procedures_bronze_clean = procedures_bronze.withColumn(
    "amount_clean",
    regexp_replace(col("amount"), "[^0-9.]", "").cast("double")
)

# Apply transformations
procedures_silver = procedures_bronze_clean \
    .dropDuplicates(['claim_id', 'procedure_code']) \
    .filter(col("claim_id").isNotNull()) \
    .filter(col("amount_clean") > 0) \
    .select(
        col("claim_id"),
        upper(trim(col("procedure_code"))).alias("procedure_code"),
        trim(col("procedure_desc")).alias("procedure_desc"),
        spark_round(col("amount_clean"), 2).alias("amount"),
        when(col("amount_clean") < 100, "Low")
        .when(col("amount_clean") < 500, "Medium")
        .when(col("amount_clean") < 1000, "High")
        .otherwise("Very High").alias("cost_category")
    )

# Show sample data
print("Transformed Procedures (first 10 rows):")
display(procedures_silver.limit(10))

# Show statistics
print("\nCost Category Distribution:")
display(procedures_silver.groupBy("cost_category").count().orderBy("cost_category"))

# Write to Silver table
procedures_silver.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("payer_silver.procedures")

### 🎯 Exercise: Data Quality Checks

Let's verify our Silver transformations worked correctly:


In [0]:
from pyspark.sql.functions import col, sum as spark_sum

# Data Quality Check 1: Check for nulls in key columns
print("=== NULL CHECK ===")
members_df = spark.table("payer_silver.members")
null_counts = members_df.select([
    spark_sum(col(c).isNull().cast("int")).alias(c)
    for c in members_df.columns
])
display(null_counts)

# Data Quality Check 2: Check for duplicates
print("\n=== DUPLICATE CHECK ===")
claims_df = spark.table("payer_silver.claims")
total_rows = claims_df.count()
distinct_rows = claims_df.select("claim_id").distinct().count()
print(f"Total rows: {total_rows}")
print(f"Distinct claim_ids: {distinct_rows}")
print(f"Duplicates: {total_rows - distinct_rows}")

# Data Quality Check 3: Value range checks
print("\n=== VALUE RANGE CHECK ===")
claims_stats = claims_df.agg(
    {"total_charge": "min", "total_charge": "max", "total_charge": "avg"}
)
display(claims_stats)

# ⭐ Gold Layer – Aggregate, Model, Ready for Analytics

---

## What is the Gold Layer?

The **Gold Layer** contains curated, business-level tables optimized for analytics and reporting. Here we:

- 📊 **Aggregate**: Create summary metrics and KPIs
- 🔗 **Join**: Combine related tables into wide, denormalized views
- 🎯 **Model**: Build fact tables, dimension tables, or star schemas
- 🚀 **Optimize**: Structure data for fast query performance
- 📈 **Business Logic**: Apply complex calculations and business rules

> **💡 Best Practice**: Gold tables should answer specific business questions directly!

---

## Common Gold Table Patterns

1. **Enriched Fact Tables**: Claims joined with members, providers, diagnoses
2. **Aggregate Summaries**: Member-level or provider-level rollups
3. **Time-Series Analytics**: Trends over time (daily, monthly, yearly)
4. **Dimensional Tables**: Lookup tables for reporting tools

---


## Step 1: Create Enriched Tables with SQL

In [0]:
%sql
-- Create gold schema
CREATE SCHEMA IF NOT EXISTS payer_gold;

-- Gold: Claims with member and provider details
CREATE OR REPLACE TABLE payer_gold.claims_enriched AS
SELECT
  c.claim_id,
  c.claim_date,
  c.total_charge,
  c.claim_status,
  m.member_id,
  m.first_name,
  m.last_name,
  m.gender,
  m.plan_id,
  p.provider_id,
  p.provider_name,
  p.specialty,
  p.city,
  p.state
FROM payer_silver.claims c
INNER JOIN payer_silver.members m ON c.member_id = m.member_id
INNER JOIN payer_silver.providers p ON c.provider_id = p.provider_id;


-- Gold: Claim Aggregates per Member
CREATE OR REPLACE TABLE payer_gold.member_claim_summary AS
SELECT
  member_id,
  COUNT(DISTINCT claim_id) AS total_claims,
  SUM(total_charge) AS sum_claims,
  MAX(total_charge) AS max_claim,
  MIN(total_charge) AS min_claim
FROM payer_silver.claims
GROUP BY member_id;


## Step 2: Create Advanced Analytics with PySpark

Let's create more sophisticated analytics tables using PySpark:

### Example 1: Provider Performance Dashboard

In [0]:
from pyspark.sql.functions import sum, avg, count, min, max, stddev, countDistinct
from pyspark.sql.window import Window

# Load tables
claims = spark.table("payer_silver.claims")
providers = spark.table("payer_silver.providers")
members = spark.table("payer_silver.members")

# Create provider performance metrics
provider_performance = claims.join(providers, "provider_id") \
    .groupBy(
        "provider_id", 
        "provider_name", 
        "specialty", 
        "city", 
        "state"
    ) \
    .agg(
        count("claim_id").alias("total_claims"),
        countDistinct("member_id").alias("unique_patients"),
        sum("total_charge").alias("total_revenue"),
        avg("total_charge").alias("avg_claim_amount"),
        min("total_charge").alias("min_claim_amount"),
        max("total_charge").alias("max_claim_amount"),
        stddev("total_charge").alias("stddev_claim_amount")
    ) \
    .orderBy(col("total_revenue").desc())

print("Top 10 Providers by Revenue:")
display(provider_performance.limit(10))

# Save to Gold table
provider_performance.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("payer_gold.provider_performance")


### Example 2: Time-Series Analysis - Monthly Claims Trends


In [0]:
from pyspark.sql.functions import year, month, date_format, lag, round as spark_round
from pyspark.sql.window import Window

# Create monthly trends
monthly_claims = claims \
    .withColumn("year", year("claim_date")) \
    .withColumn("month", month("claim_date")) \
    .withColumn("month_year", date_format("claim_date", "yyyy-MM")) \
    .groupBy("year", "month", "month_year", "claim_status") \
    .agg(
        count("claim_id").alias("claim_count"),
        sum("total_charge").alias("total_charges"),
        avg("total_charge").alias("avg_charge")
    ) \
    .orderBy("year", "month", "claim_status")

# Calculate month-over-month growth
window_spec = Window.partitionBy("claim_status").orderBy("year", "month")

monthly_trends = monthly_claims \
    .withColumn("prev_month_charges", lag("total_charges").over(window_spec)) \
    .withColumn(
        "mom_growth_pct", 
        spark_round(
            ((col("total_charges") - col("prev_month_charges")) / col("prev_month_charges") * 100), 
            2
        )
    )

print("Monthly Claims Trends with Growth:")
display(monthly_trends)

# Save to Gold
monthly_trends.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("payer_gold.monthly_claims_trends")


# 🤖 Using Databricks AI Assistant

---

Databricks AI Assistant can help you write code, understand data, and troubleshoot issues!

### How to Use AI Assistant:
1. Click the AI Assistant icon
2. Ask questions in natural language
3. Get code suggestions and explanations

### Example Prompts to Try:
- "What kind of aggregations can I do with table payer_gold.claims_enriched?"
- "How do I calculate the total claims by specialty?"
- "Show me how to create a window function for running totals"
- "What does spark.table() command do?"
- "Help me debug this PySpark error"

---



# 📊 Analytics & Visualization

---

## Introduction to Databricks Visualizations

Databricks provides powerful built-in visualization capabilities. You can create:
- 📊 Bar charts, line charts, scatter plots
- 🗺️ Geographic maps
- 📈 Histograms and box plots
- 🥧 Pie charts and area charts

> **💡 Tip**: Use `display()` function to automatically generate interactive visualizations!

---

## Statistical Analysis with PySpark

Let's explore our data using statistical analysis and visualizations:

In [0]:
display(spark.table("payer_gold.claims_enriched"))

## Example Visualizations


### 1. Claims by Status (Bar Chart)

In [0]:
from pyspark.sql.functions import sum, count, avg

# Aggregate total charges by claim status
claims_by_status = spark.table("payer_gold.claims_enriched") \
    .groupBy("claim_status") \
    .agg(
        sum("total_charge").alias("total_charges"),
        count("claim_id").alias("claim_count"),
        avg("total_charge").alias("avg_charge")
    ) \
    .orderBy("total_charges", ascending=False)

# Visualize as a bar chart: X-axis = claim_status, Y-axis = total_charges
display(claims_by_status)

# Tip: In the visualization options, select "Bar Chart", set X-axis to 'claim_status', Y-axis to 'total_charges'

Databricks visualization. Run in Databricks to view.

### 2. Gender Distribution (Pie Chart)


In [0]:
# Analyze claims distribution by gender
gender_analysis = spark.table("payer_gold.claims_enriched") \
    .groupBy("gender") \
    .agg(
        count("claim_id").alias("total_claims"),
        sum("total_charge").alias("total_charges"),
        avg("total_charge").alias("avg_charge_per_claim")
    )

print("👥 Claims Analysis by Gender:")
display(gender_analysis)

# Tip: Try changing the visualization to a Pie Chart to see the distribution!

Databricks visualization. Run in Databricks to view.

### 3. Time Series - Claims Over Time (Line Chart)


In [0]:
# Time series analysis of claims
time_series = spark.table("payer_gold.claims_enriched") \
    .groupBy("claim_date") \
    .agg(
        sum("total_charge").alias("daily_charges"),
        count("claim_id").alias("daily_claim_count")
    ) \
    .orderBy("claim_date")

print("📈 Daily Claims Trends:")
display(time_series)

# Tip: Select Line Chart visualization with claim_date as X-axis and daily_charges as Y-axis!

Databricks visualization. Run in Databricks to view.

### 4. Geographic Analysis - Claims by City (Map/Bar Chart)


In [0]:
# Geographic distribution of claims
city_analysis = spark.table("payer_gold.claims_enriched") \
    .groupBy("city", "state") \
    .agg(
        count("claim_id").alias("total_claims"),
        sum("total_charge").alias("total_charges"),
        countDistinct("member_id").alias("unique_members")
    ) \
    .orderBy("total_charges", ascending=False)

print("🗺️ Claims Distribution by City:")
display(city_analysis.limit(20))

# Tip: Try Map visualization if your data has proper geographic fields!

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

### 5. Distribution Analysis - Charge Amount Histogram


In [0]:
# Analyze the distribution of claim charges
charges = spark.table("payer_gold.claims_enriched").select("total_charge")

# Get statistical summary
print("📊 Statistical Summary of Claim Charges:")
charges.describe().show()

# Display histogram
print("\n💵 Charge Distribution (Histogram):")
display(charges)

# Tip: Select Histogram visualization to see the distribution of charges!
# You can adjust the number of bins for better granularity.

Databricks visualization. Run in Databricks to view.

### 6. Correlation Analysis - Scatter Plot


In [0]:
# Scatter plot to find relationships between variables
scatter_data = spark.table("payer_gold.claims_enriched") \
    .select("claim_date", "total_charge", "claim_status")

print("📍 Scatter Plot: Claim Date vs Charge Amount")
display(scatter_data)

# Tip: Select Scatter Plot visualization
# X-axis: claim_date, Y-axis: total_charge
# Group by: claim_status for color coding

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.


## 🎯 Advanced Analytics Example: Cohort Analysis

Let's create a more complex analysis - member cohort analysis based on their enrollment date:


In [0]:
from pyspark.sql.functions import datediff, months_between, floor

# Join members with their claims
member_claims = spark.table("payer_silver.members") \
    .join(spark.table("payer_silver.claims"), "member_id", "left")

# Create cohorts based on enrollment month
cohort_analysis = member_claims \
    .withColumn("enrollment_month", date_format("effective_date", "yyyy-MM")) \
    .withColumn("claim_month", date_format("claim_date", "yyyy-MM")) \
    .withColumn(
        "months_since_enrollment", 
        floor(months_between("claim_date", "effective_date"))
    ) \
    .groupBy("enrollment_month", "months_since_enrollment") \
    .agg(
        countDistinct("member_id").alias("active_members"),
        count("claim_id").alias("total_claims"),
        sum("total_charge").alias("total_charges")
    ) \
    .filter(col("months_since_enrollment").isNotNull()) \
    .orderBy("enrollment_month", "months_since_enrollment")

print("📅 Member Cohort Analysis:")
display(cohort_analysis)

# Save as a Gold table for future analysis
cohort_analysis.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("payer_gold.member_cohort_analysis")


# AI/BI

Intelligent analytics for everyone!

Databricks AI/BI is a new type of business intelligence product designed to provide a deep understanding of your data's semantics, enabling self-service data analysis for everyone in your organization. AI/BI is built on a compound AI system that draws insights from the full lifecycle of your data across the Databricks platform, including ETL pipelines, lineage, and other queries.

<img src="https://www.databricks.com/sites/default/files/2025-05/hero-image-ai-bi-v2-2x.png?v=1748417271" alt="Managed Tables" width="600" height="500">

# Genie

Talk with your data

Now everyone can get insights from data simply by asking questions in natural language.

<img src="https://www.databricks.com/sites/default/files/2025-06/ai-bi-genie-hero.png?v=1749162682" alt="Managed Tables" width="600" height="500">


# 📚 Best Practices & Performance Tips

## 🚀 Performance Optimization

### 1. **Use Partitioning for Large Tables**
```python
# Partition by date for time-series data
df.write \
    .format("delta") \
    .partitionBy("claim_date") \
    .saveAsTable("payer_gold.claims_partitioned")
```

### 2. **Enable Z-Ordering for Common Filters**
```sql
OPTIMIZE payer_gold.claims_enriched
ZORDER BY (member_id, claim_date);
```

### 3. **Use Caching for Frequently Accessed DataFrames**
```python
claims_df = spark.table("payer_silver.claims").cache()
# Now use claims_df multiple times without re-reading
```

### 4. **Broadcast Small Tables in Joins**
```python
from pyspark.sql.functions import broadcast

large_df.join(broadcast(small_df), "key")
```

---

## 🔒 Data Quality Best Practices

### 1. **Always Validate Data**
```python
# Add constraints
spark.sql("""
    ALTER TABLE payer_silver.claims 
    ADD CONSTRAINT valid_charge CHECK (total_charge > 0)
""")
```

### 2. **Use Schema Evolution Carefully**
```python
# Explicitly define schema for production
from pyspark.sql.types import *

schema = StructType([
    StructField("claim_id", StringType(), False),
    StructField("total_charge", DoubleType(), True),
    # ... more fields
])
```

### 3. **Implement Data Quality Checks**
```python
def validate_claims(df):
    """Run data quality checks"""
    checks = {
        "null_claim_ids": df.filter(col("claim_id").isNull()).count(),
        "negative_charges": df.filter(col("total_charge") < 0).count(),
        "future_dates": df.filter(col("claim_date") > current_date()).count()
    }
    return checks
```

---

## 💾 Delta Lake Best Practices

### 1. **Regular Maintenance**
```sql
-- Compact small files
OPTIMIZE payer_gold.claims_enriched;

-- Remove old versions (keep 7 days)
VACUUM payer_gold.claims_enriched RETAIN 168 HOURS;

-- Update statistics
ANALYZE TABLE payer_gold.claims_enriched COMPUTE STATISTICS;
```

### 2. **Use Time Travel for Auditing**
```sql
-- Query previous version
SELECT * FROM payer_gold.claims_enriched VERSION AS OF 1;

-- Query as of timestamp
SELECT * FROM payer_gold.claims_enriched TIMESTAMP AS OF '2025-01-01';
```

### 3. **Enable Change Data Feed**
```sql
ALTER TABLE payer_gold.claims_enriched 
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
```

---

## 🏗️ Architecture Best Practices

### 1. **Medallion Layer Guidelines**
- **Bronze**: Keep all source data, minimal transformation
- **Silver**: One source system = one silver table (usually)
- **Gold**: Many silver tables → one gold table (join/aggregate)

### 2. **Naming Conventions**
```
Catalog: <organization>_<environment>
Schema: <domain>_<layer>
Table: <entity>_<descriptor>

Examples:
- acme_prod.payer_bronze.claims_raw
- acme_dev.payer_silver.claims_cleaned
- acme_prod.payer_gold.member_360_view
```

### 3. **Documentation**
```sql
-- Add table comments
COMMENT ON TABLE payer_gold.claims_enriched IS 
'Enriched claims with member and provider details for analytics';

-- Add column comments
ALTER TABLE payer_gold.claims_enriched 
ALTER COLUMN total_charge COMMENT 'Total charged amount in USD';
```

---


# 🎓 Workshop Summary & Next Steps

## 🎉 Congratulations!

You've completed the Databricks Healthcare Payer Analytics Workshop! Let's review what you learned:

---

## ✅ What You Accomplished

### 1. **Medallion Architecture**
- ✅ Built a complete **Bronze → Silver → Gold** pipeline
- ✅ Understood data quality improvement at each layer
- ✅ Created analytics-ready datasets

### 2. **Data Engineering Skills**
- ✅ Loaded data using **COPY INTO**
- ✅ Transformed data with **SQL and PySpark**
- ✅ Applied data quality checks and validations
- ✅ Created aggregations and derived metrics

### 3. **Analytics & Visualization**
- ✅ Generated business insights from data
- ✅ Created interactive visualizations
- ✅ Performed statistical analysis
- ✅ Built executive dashboards

### 4. **Databricks Platform**
- ✅ Worked with **Unity Catalog**
- ✅ Used **Delta Lake** for reliable data storage
- ✅ Leveraged **AI Assistant** for code help
- ✅ Applied performance optimization techniques

---

## 🚀 Next Steps

### Immediate Actions
1. ⭐ **Bookmark this notebook** for future reference
2. 📖 Complete the hands-on exercises
3. 🔄 Try modifying the code with your own logic
4. 💾 Export your results and share with your team

### Continue Learning

#### 📚 Advanced Topics to Explore
- **Delta Live Tables (DLT)**: Declarative pipeline framework
- **Databricks Workflows**: Orchestration and scheduling
- **Unity Catalog**: Advanced governance features
- **Databricks SQL**: Performance tuning and optimization
- **Machine Learning**: MLflow and Feature Store
- **Streaming**: Structured Streaming with Delta Lake

#### 🔗 Helpful Resources
- [Databricks Documentation](https://docs.databricks.com/)
- [Delta Lake Guide](https://docs.delta.io/)
- [Databricks Academy](https://www.databricks.com/learn/training)
- [Community Forums](https://community.databricks.com/)
- [Databricks Blog](https://www.databricks.com/blog)

#### 🎯 Recommended Certifications
- **Databricks Lakehouse Platform Fundamentals**
- **Databricks Certified Data Engineer Associate**
- **Databricks Certified Data Analyst Associate**

---

## 💡 Real-World Applications

Apply these skills to:
- 🏥 **Healthcare**: Claims processing, patient analytics, risk scoring
- 🏦 **Finance**: Fraud detection, transaction analysis, risk management
- 🛒 **Retail**: Customer analytics, inventory optimization, sales forecasting
- 📱 **Technology**: User behavior analysis, product metrics, A/B testing

---

## 🤝 Get Help & Share

### Need Help?
- 💬 Ask the **Databricks AI Assistant**
- 🌐 Visit [Databricks Community](https://community.databricks.com/)
- 📧 Contact your Databricks account team
- 📖 Check [Stack Overflow](https://stackoverflow.com/questions/tagged/databricks)

### Share Your Success
- ⭐ Share insights with your colleagues
- 📊 Create dashboards for stakeholders
- 🎤 Present your work at team meetings
- 🏆 Contribute to the Databricks community

---

## 📝 Feedback

We'd love to hear your thoughts on this workshop!

**What worked well?** What could be improved? **What topics do you want to learn next?**

---

## 🙏 Thank You!

Thank you for participating in this workshop. We hope you found it valuable and are excited to continue your Databricks journey! 🚀

---

*Last Updated: October 7, 2025*


# 📖 Quick Reference Guide

## Common PySpark Operations

### Reading Data
```python
# From Delta table
df = spark.table("catalog.schema.table")

# From CSV
df = spark.read.format("csv").option("header", "true").load("path/to/file.csv")

# From JSON
df = spark.read.json("path/to/file.json")

# From Parquet
df = spark.read.parquet("path/to/file.parquet")
```

### Writing Data
```python
# Write to Delta table
df.write.format("delta").mode("overwrite").saveAsTable("table_name")

# Append mode
df.write.format("delta").mode("append").saveAsTable("table_name")

# With partitioning
df.write.format("delta").partitionBy("date_col").saveAsTable("table_name")
```

### Common Transformations
```python
from pyspark.sql.functions import *

# Select columns
df.select("col1", "col2")

# Filter rows
df.filter(col("amount") > 100)
df.where("amount > 100")

# Add new column
df.withColumn("new_col", col("old_col") * 2)

# Rename column
df.withColumnRenamed("old_name", "new_name")

# Drop column
df.drop("col_name")

# Group by and aggregate
df.groupBy("category").agg(
    count("*").alias("count"),
    sum("amount").alias("total"),
    avg("amount").alias("average")
)

# Join tables
df1.join(df2, "key_column")
df1.join(df2, df1.key == df2.key, "left")

# Sort
df.orderBy("col_name")
df.orderBy(col("col_name").desc())

# Remove duplicates
df.dropDuplicates()
df.dropDuplicates(["col1", "col2"])
```

### Common Functions
```python
# String functions
upper("col_name")
lower("col_name")
trim("col_name")
concat("col1", "col2")
substring("col_name", 1, 5)

# Date functions
current_date()
current_timestamp()
date_format("date_col", "yyyy-MM-dd")
year("date_col")
month("date_col")
datediff("date1", "date2")

# Math functions
round("col_name", 2)
abs("col_name")
ceil("col_name")
floor("col_name")

# Conditional logic
when(col("amount") > 100, "High").otherwise("Low")

# Null handling
col("col_name").isNull()
col("col_name").isNotNull()
coalesce("col1", "col2", lit(0))
```

## Common SQL Operations

### DDL Commands
```sql
-- Create database
CREATE DATABASE IF NOT EXISTS database_name;

-- Drop database
DROP DATABASE IF EXISTS database_name CASCADE;

-- Create table
CREATE TABLE table_name (
    id STRING,
    amount DOUBLE,
    date DATE
);

-- Drop table
DROP TABLE IF EXISTS table_name;

-- Describe table
DESCRIBE EXTENDED table_name;
SHOW COLUMNS FROM table_name;
```

### DML Commands
```sql
-- Insert data
INSERT INTO table_name VALUES (1, 'value1', 100);

-- Update data (Delta Lake)
UPDATE table_name SET amount = 200 WHERE id = 1;

-- Delete data (Delta Lake)
DELETE FROM table_name WHERE id = 1;

-- Merge (Upsert)
MERGE INTO target_table
USING source_table
ON target_table.id = source_table.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
```

### Query Commands
```sql
-- Basic SELECT
SELECT * FROM table_name LIMIT 10;

-- With WHERE clause
SELECT * FROM table_name WHERE amount > 100;

-- Aggregations
SELECT category, COUNT(*), SUM(amount), AVG(amount)
FROM table_name
GROUP BY category;

-- Joins
SELECT a.*, b.name
FROM table_a a
INNER JOIN table_b b ON a.id = b.id;

-- Window functions
SELECT 
    *,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY amount DESC) as rank
FROM table_name;

-- CTE (Common Table Expression)
WITH summary AS (
    SELECT category, SUM(amount) as total
    FROM table_name
    GROUP BY category
)
SELECT * FROM summary WHERE total > 1000;
```

## Databricks Utilities
```python
# File system operations
dbutils.fs.ls("path/")
dbutils.fs.cp("source", "destination")
dbutils.fs.rm("path/", recurse=True)
dbutils.fs.mkdirs("path/")

# Widgets (parameters)
dbutils.widgets.text("param_name", "default_value")
param_value = dbutils.widgets.get("param_name")

# Notebooks
dbutils.notebook.run("notebook_path", timeout_seconds, {"param": "value"})
```

---

*Keep this reference handy as you build your data pipelines!*
