# DBSQL Databricks Workshop Hands-on
**Intro to Databricks ETL and Analytics**

ver. September 8, 2025

# Introduction
This guide shows how to design a **Databricks Medallion Architecture** (Bronze, Silver, Gold) pipeline with SQL, using sample tables and realistic transformations relevant to a **Healthcare Payer**. All code is written to work in Databricks SQL notebooks.


# What is a lakehouse?

1. **Hybrid Architecture:**  
   A lakehouse combines the best of data lakes (flexible, cheap storage) and data warehouses (structured, fast analytics), providing transactional and governance features on top of open cloud storage.

2. **ACID Transactions and Schema Governance:**  
   Lakehouses support ACID transactions for consistent concurrent data access and enforce schema management, which is essential for data integrity and compliance.

3. **Open and Decoupled:**  
   They use open file formats (like Parquet), decouple compute from storage for flexible scalability, and allow access by a variety of analytics, BI, and machine learning tools.

4. **Supports All Workloads and Data Types:**  
   The architecture enables SQL analytics, data science, machine learning, and can handle structured, semi-structured, and unstructured data (including images, text, video).

5. **Single Platform, Enterprise Ready:**  
   With features like real-time streaming, end-to-end governance, access control, and data discovery tools, lakehouses reduce complexity—allowing enterprises to manage all data and analytics needs in one unified system.
![](https://www.databricks.com/wp-content/uploads/2020/01/data-lakehouse-new.png)

## Unity Catalog

[Unified and open governance for data and AI in the Lakehouse](https://www.databricks.com/product/unity-catalog#features)

Eliminate silos, simplify governance and accelerate insights at scale:

- Centralizes governance, access control, auditing, and data discovery for all data and AI assets across Databricks workspaces.
- Enables fine-grained, consistent data access policies (row- and column-level), defined once and applied everywhere.
- Provides comprehensive data lineage and audit logs, showing how and by whom data is accessed and transformed.
- Supports data discovery, tagging, and documentation, making it easier to find and understand datasets and models.
- Works across multiple clouds and supports open formats (Delta, Parquet, etc.), avoiding vendor lock-in and enabling broad interoperability.
- Allows secure data and AI sharing within and outside the organization, including clean rooms and partner collaborations.
- Provides built-in monitoring for data quality, freshness, and usage, helping ensure compliance and rapid troubleshooting.
- Integrates tightly with the catalog/schema/object model, enhancing organization and security for all managed data assets.

![](https://www.databricks.com/sites/default/files/2025-05/header-unity-catalog.png?v=1748513086)

[Unity Catalog Search & Data Explorer](https://app.getreprise.com/launch/96mpAqy/)

[Exploring Lineage and Governance with Unity Catalog](https://app.getreprise.com/launch/MnqjQDX/)

[A Comprehensive Guide to Data and AI Governance](https://www.databricks.com/sites/default/files/2024-08/comprehensive-guide-to-data-and-ai-governance.pdf)






## Medallion lakehouse architecture

In this example, we will be following the **medallion lakehouse architecture**. The medallion architecture is a data design pattern to organize data in a lakehouse. The goal is to progressively improve the quality and structure of the data as it flows through each layer (Bronze [**raw**] → Silver [**staging**] → Gold [**main**]).

1. **Bronze layer**: the raw, unvalidated data
2. **Silver**: cleansed and conformed data
3. **Gold**: curated business-level tables

<img src="https://www.databricks.com/sites/default/files/inline-images/building-data-pipelines-with-delta-lake-120823.png?v=1702318922" alt="Managed Tables" width="600" height="500">

## Managed tables

[How Unity Catalog Managed Tables Automate Performance at Scale](https://www.databricks.com/blog/how-unity-catalog-managed-tables-automate-performance-scale) with [Predictive Optimization](https://learn.microsoft.com/en-us/azure/databricks/optimizations/predictive-optimization)


<!-- ![](https://www.databricks.com/sites/default/files/inline-images/image2_48.png?v=1751297384) -->

<img src="https://www.databricks.com/sites/default/files/inline-images/image2_48.png?v=1751297384" alt="Managed Tables" width="600" height="500">


[Faster Queries: 20X query latency reduction](https://www.databricks.com/blog/predictive-optimization-automatically-delivers-faster-queries-and-lower-tco)

**Predictive Optimization** in Databricks automates table management by leveraging Unity Catalog and the Data Intelligence Platform. This innovative feature currently runs the following optimizations for Unity Catalog managed tables:

* **OPTIMIZE** - Triggers incremental clustering for enabled tables. Improves query performance by optimizing file sizes.
* **VACUUM** - Reduces storage costs by deleting data files no longer referenced by the table.
* **ANALYZE** - Triggers incremental update of statistics to improve query performance. 


<img src="https://www.databricks.com/sites/default/files/styles/max_1000x1000/public/2024-05/db-976-blog-img-og.png?itok=qWBT8VA-&v=1717158571" alt="Managed Tables" width="600" height="500">

**Compaction** - This enhances query performance by optimizing file sizes, ensuring that data retrieval is efficient.

**Liquid Clustering** - This technique incrementally clusters incoming data, enabling optimal data layout and efficient data skipping.



# Databricks Medallion Pipeline for a Healthcare Payer


## Modeling Concepts

Databricks fully supports both **dimensional modeling** (Kimball/star schema) and **Inmon-style, Data Vault architectures (hubs, satellites, links)** on the Lakehouse platform. For dimensional models, you can build classic star and snowflake schemas directly with SQL, benefiting from ACID transactions and scalable Delta Lake tables.

For Inmon/Data Vault use cases, Databricks provides rich support for hub-and-satellite models that address core enterprise needs for history, auditability, and extensibility—find end.

The Lakehouse approach lets you mix these styles as needed within a single platform, so you can incrementally land data in Raw Vault/EDW structures and later expose it as dimensional marts—all with Delta Live Tables, fine-grained security, and open formats.

Key blog resources:

[Implementing Dimensional Modeling](https://www.databricks.com/blog/implementing-dimensional-data-warehouse-databricks-sql-part-1)

[Implementing Data Vault/Hub-Satellite](https://www.databricks.com/blog/2022/06/24/prescriptive-guidance-for-implementing-a-data-vault-model-on-the-databricks-lakehouse-platform.html) 

[Data Vault Best Practices](https://www.databricks.com/blog/data-vault-best-practice-implementation-lakehouse)

<div style="display: flex; justify-content: space-between;">
  <img src="https://user-gen-media-assets.s3.amazonaws.com/gpt4o_images/5c87faea-3e60-4f71-826d-42d04f6cdc0b.png" alt="Managed Tables" width="400" height="350">
  <img src="https://user-gen-media-assets.s3.amazonaws.com/gpt4o_images/6826c275-d462-4c07-a978-43fe9c40f3ed.png" alt="Managed Tables" width="400" height="350">
</div>






## Sample Data Model

For a payer, commonly used tables include:

- **Members**: members enrolled in a health plan
- **Claims**: medical claim submissions
- **Providers**: healthcare providers (doctors, clinics)
- **Diagnoses**: claim diagnosis codes
- **Procedures**: procedures/services performed

Each table should have at least 50 rows.

<img src="https://user-gen-media-assets.s3.amazonaws.com/gpt4o_images/bdd54dc0-f3c7-4975-80a3-0017ebdb121c.png" alt="Managed Tables" width="400" height="300">





## Table	Key Columns

**Members**	member_id, first_name, last_name, birth_date, gender, plan_id, effective_date

**Claims**	claim_id, member_id, provider_id, claim_date, total_charge, claim_status

**Providers**	provider_id, npi, provider_name, specialty, address, city, state

**Diagnoses**	claim_id, diagnosis_code, diagnosis_desc

**Procedures**	claim_id, procedure_code, procedure_desc, amount

# SETUP

In [16]:
dbutils.widgets.text("catalog", "my_catalog", "Catalog")
dbutils.widgets.text("bronze_db", "payer_bronze", "Bronze DB")
dbutils.widgets.text("silver_db", "payer_silver", "Silver DB")
dbutils.widgets.text("gold_db", "payer_gold", "Gold DB")

catalog = dbutils.widgets.get("catalog")
bronze_db = dbutils.widgets.get("bronze_db")
silver_db = dbutils.widgets.get("silver_db")
gold_db = dbutils.widgets.get("gold_db")

path = f"/Volumes/{catalog}/{bronze_db}/payer/files/"

Box(children=(Label(value='Catalog'), Text(value='my_catalog')))

Box(children=(Label(value='Bronze DB'), Text(value='payer_bronze')))

Box(children=(Label(value='Silver DB'), Text(value='payer_silver')))

Box(children=(Label(value='Gold DB'), Text(value='payer_gold')))

In [17]:
print(f"Catalog: {catalog}")
print(f"Bronze DB: {bronze_db}")
print(f"Silver DB: {silver_db}")
print(f"Silver DB: {silver_db}")
print(f"Gold DB: {gold_db}")
print(f"Path: {path}")

Catalog: my_catalog
Bronze DB: payer_bronze
Silver DB: payer_silver
Silver DB: payer_silver
Gold DB: payer_gold
Path: /Volumes/my_catalog/payer_bronze/payer/files/


In [18]:
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")

In [19]:
spark.sql(f"USE CATALOG {catalog}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {bronze_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {silver_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {gold_db}")

Create new **Volumes** as below and upload shared files to your volumes.

(schema) payer_bronze \
|--- payer_data \
|------ claims \
|------ members \
|------ providers \
|------ diagnoses \
|------ procedures


In [20]:
spark.sql(f"CREATE VOLUME IF NOT EXISTS {bronze_db}.payer")

In [21]:
# Create the volume and folders
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/claims")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/diagnosis")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/procedures")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/members")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/providers")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/downloads")

True

In [None]:
import requests
import zipfile
import io
import os
import shutil
# =============================================================================
# ❌ DO NOT USE - ORIGINAL CODE WITH PERMISSION ISSUES
# This code tries to run file operations on the driver (local machine) 
# which doesn't have access to Databricks /Volumes/ paths
# =============================================================================

# import requests
# import zipfile
# import io
# import os
# import shutil

# # Define the URL of the ZIP file
# url = "https://github.com/bigdatavik/databricksfirststeps/blob/6b225621c3c010a2734ab604efd79c15ec6c71b8/data/Payor_Archive.zip?raw=true"

# # Download the ZIP file
# response = requests.get(url)
# zip_file = zipfile.ZipFile(io.BytesIO(response.content))

# # Define the base path
# base_path = f"/Volumes/{catalog}/{bronze_db}/payer/downloads" 

# # ❌ This line causes "Operation not permitted" because it runs on driver
# # zip_file.extractall(base_path)

# # ❌ All subsequent file operations will also fail
# # Define the paths, create directories, move files, etc.

print("⚠️  Original code commented out - contains permission issues")
print("✅ Use Cell 21 instead for working UDF solution")



PermissionError: [Errno 13] Permission denied: '/Volumes/my_catalog'

In [None]:
# Define UDF to copy files on Databricks compute
@udf(returnType=StringType())
def copy_files_udf(base_path, catalog, bronze_db):
    """
    UDF to copy files to their final destinations on Databricks compute nodes.
    This runs on the executor nodes which have access to Databricks file system.
    """
    try:
        # Define file mappings: source_path -> destination_path
        file_mappings = {
            f"{base_path}/claims/claims.csv": f"/Volumes/{catalog}/{bronze_db}/payer/files/claims/claims.csv",
            f"{base_path}/diagnosis/diagnoses.csv": f"/Volumes/{catalog}/{bronze_db}/payer/files/diagnosis/diagnosis.csv",
            f"{base_path}/procedures/procedures.csv": f"/Volumes/{catalog}/{bronze_db}/payer/files/procedures/procedures.csv",
            f"{base_path}/members/member.csv": f"/Volumes/{catalog}/{bronze_db}/payer/files/members/members.csv",
            f"{base_path}/providers/providers.csv": f"/Volumes/{catalog}/{bronze_db}/payer/files/providers/providers.csv"
        }
        
        results = []
        for source, destination in file_mappings.items():
            if os.path.exists(source):
                # Ensure destination directory exists
                dest_dir = os.path.dirname(destination)
                os.makedirs(dest_dir, exist_ok=True)
                
                # Copy the file
                shutil.copy(source, destination)
                results.append(f"Copied to {destination}")
            else:
                results.append(f"Source file not found: {source}")
        
        return "\n".join(results)
        
    except Exception as e:
        return f"Error copying files: {str(e)}"

# Execute the file copying UDF
copy_df = spark.createDataFrame([(base_path, catalog, bronze_db)], ["base_path", "catalog", "bronze_db"])
copy_result_df = copy_df.select(copy_files_udf("base_path", "catalog", "bronze_db").alias("copy_result"))

# Display the results
copy_result = copy_result_df.collect()[0]["copy_result"]
print(copy_result)

FileNotFoundError: [Errno 2] No such file or directory: '/Volumes/my_catalog/payer_bronze/payer/downloads/claims/claims.csv'

In [None]:
# =============================================================================
# 📁 FILE EXTRACTION USING MODULAR UTILITIES
# Clean, simple approach using external utility modules
# =============================================================================

# Import the file extraction utilities
import sys
import os

# Add the src directory to Python path for imports
src_path = os.path.join(os.getcwd(), 'src')
if src_path not in sys.path:
    sys.path.append(src_path)

# Import utility functions
from file_utils import extract_payer_data

# =============================================================================
# ✅ SIMPLE ONE-LINE FILE EXTRACTION
# =============================================================================

print("🚀 Extracting payer data files using modular utilities...")
print("=" * 60)

# Extract files with a single function call
extraction_results = extract_payer_data(
    spark=spark,
    catalog=catalog,
    bronze_db=bronze_db
)

# Display results
if extraction_results['success']:
    print("\n🎉 FILE EXTRACTION COMPLETED SUCCESSFULLY!")
    print(f"Method used: {extraction_results['method_used']}")
else:
    print("\n❌ FILE EXTRACTION FAILED!")
    print("Check the detailed results above for troubleshooting.")

print("\n📋 Summary:")
print(f"✅ Success: {extraction_results['success']}")
print(f"🔧 Method: {extraction_results['method_used']}")
print(f"📁 Directories: {len(extraction_results['directory_creation']['directories_created'])} created")

# Show any errors
if extraction_results['directory_creation']['errors']:
    print("\n⚠️ Directory creation warnings:")
    for error in extraction_results['directory_creation']['errors']:
        print(f"   {error}")

print("\n✅ Ready to proceed with the data pipeline!")


🚀 Starting file download and extraction using UDF...
(All file operations happen on Databricks compute with volume access)


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))


✅ UDF Results:
Successfully extracted ZIP to temporary directory
Successfully copied claims.csv to /Volumes/my_catalog/payer_bronze/payer/files/claims/claims.csv
Successfully copied diagnoses.csv to /Volumes/my_catalog/payer_bronze/payer/files/diagnosis/diagnosis.csv
Successfully copied procedures.csv to /Volumes/my_catalog/payer_bronze/payer/files/procedures/procedures.csv
Successfully copied member.csv to /Volumes/my_catalog/payer_bronze/payer/files/members/members.csv
Successfully copied providers.csv to /Volumes/my_catalog/payer_bronze/payer/files/providers/providers.csv


In [None]:
# =============================================================================
# 🧪 TESTING & VALIDATION USING MODULAR UTILITIES  
# Simple testing approach using external utility modules
# =============================================================================

# Import testing utilities
from test_utils import run_all_tests, quick_health_check

print("🧪 Testing utilities loaded!")
print("Available functions:")
print("  • run_all_tests() - Comprehensive validation suite")
print("  • quick_health_check() - Fast verification using dbutils")
print("\n✅ Ready to test the extracted files!")


In [None]:
# =============================================================================
# 🚀 RUN COMPREHENSIVE TESTS ON EXTRACTED FILES
# Simple one-line testing using modular utilities
# =============================================================================

# Run comprehensive tests
test_results = run_all_tests(
    spark=spark,
    catalog=catalog,
    bronze_db=bronze_db
)

# Display summary
print(f"\n🎯 TESTING SUMMARY:")
print(f"Overall Success: {'✅ PASSED' if test_results['overall_success'] else '❌ FAILED'}")

if not test_results['overall_success']:
    print("\n💡 Troubleshooting tips:")
    print("• Re-run Cell 21 to re-extract files")
    print("• Check Cell 24 for debugging utilities")
    print("• Verify Unity Catalog permissions")
else:
    print("\n🎉 All tests passed! Files are ready for the data pipeline.")


In [None]:
# =============================================================================
# 🛠️ DEBUGGING & TROUBLESHOOTING USING MODULAR UTILITIES
# Simple debugging approach using external utility modules
# =============================================================================

# Import debugging utilities
from debug_utils import run_full_diagnostics

print("🛠️ Debugging utilities loaded!")
print("\nAvailable functions:")

print("\n🏥 Quick Health Check:")
print("  quick_health_check(spark, catalog, bronze_db)")

print("\n🔧 Full Diagnostics:")  
print("  run_full_diagnostics(spark, catalog, bronze_db)")

print("\n💡 Example usage:")
print("  # Quick check")
print("  quick_health_check(spark, catalog, bronze_db)")
print("  ")
print("  # Full diagnostics if issues found")
print("  diagnostics = run_full_diagnostics(spark, catalog, bronze_db)")

print("\n✅ Ready to debug any file extraction issues!")


# Let's Build Your First Data Pipeline!

In [0]:
# %sql
# -- Set the catalog and schema
# CREATE CATALOG IF NOT EXISTS my_catalog;
# USE CATALOG my_catalog;

# -- Create bronze schema
# CREATE SCHEMA IF NOT EXISTS payer_bronze;

# Bronze Layer – Ingest Raw Data

The bronze layer ingests raw files (CSV, JSON, Parquet) and lands them in Delta tables with minimal transformation.

**Example: Create Bronze Tables**

In [56]:
%sql
LIST '/Volumes/my_catalog/payer_bronze/payer/files/claims/'

Unnamed: 0,path,name,size,modification_time
0,/Volumes/my_catalog/payer_bronze/payer/files/claims/claims.csv,claims.csv,395,1757301456000


## COPY INTO

[COPY INTO](https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/delta-copy-into) Loads data from a file location into a Delta table. This is a retryable and idempotent operation — Files in the source location that have already been loaded are skipped.

[Examples - COPY INTO](https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/copy-into/)

[Tutorial - COPY INTO](https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/copy-into/tutorial-notebook)

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.claims_raw;
COPY INTO payer_bronze.claims_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/claims/')
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true', 'force' = 'true');

-- NOTE: 'force = true' is used here for demo purposes only to reload all files every time. In production, omit this option so COPY INTO only processes new data files.

num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
8,8,0


In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.diagnosis_raw;
COPY INTO payer_bronze.diagnosis_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/diagnosis/')

FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');

num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
9,9,0


In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.members_raw;
COPY INTO payer_bronze.members_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/members/')

FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');

num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
15,15,0


In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.procedures_raw;
COPY INTO payer_bronze.procedures_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/procedures/')
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');

num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
7,7,0


In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.providers_raw;
COPY INTO payer_bronze.providers_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/providers/')
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');

num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
5,5,0


# Silver Layer – Transform, Clean, and Join

The silver layer cleans data, applies business rules, type-casts fields, deduplicates, and joins tables when needed.

**Example: Transform Bronze to Silver**

In [0]:
%sql
-- Create silver schema
CREATE SCHEMA IF NOT EXISTS payer_silver;


-- Members: select relevant fields, cast types, remove duplicates
CREATE OR REPLACE TABLE payer_silver.members AS
SELECT
  DISTINCT CAST(member_id AS STRING) AS member_id,
  TRIM(first_name) AS first_name,
  TRIM(last_name) AS last_name,
  CAST(birth_date AS DATE) AS birth_date,
  gender,
  plan_id,
  CAST(effective_date AS DATE) AS effective_date
FROM payer_bronze.members_raw
WHERE member_id IS NOT NULL;


-- Claims: remove duplicates, prepare data
CREATE OR REPLACE TABLE payer_silver.claims AS
SELECT
  DISTINCT claim_id,
  member_id,
  provider_id,
  CAST(claim_date AS DATE) AS claim_date,
  ROUND(total_charge, 2) AS total_charge,
  LOWER(claim_status) AS claim_status
FROM payer_bronze.claims_raw
WHERE claim_id IS NOT NULL AND total_charge > 0;


-- Providers: deduplicate
CREATE OR REPLACE TABLE payer_silver.providers AS
SELECT
  DISTINCT provider_id,
  npi,
  provider_name,
  specialty,
  address,
  city,
  state
FROM payer_bronze.providers_raw
WHERE provider_id IS NOT NULL;


num_affected_rows,num_inserted_rows



# Gold Layer – Aggregate, Model, Ready for Analytics

Gold tables are optimized for business usage: facts, dimensions, and aggregated views.

**Example: Build Analytics-Friendly Gold Tables**

In [0]:
%sql
-- Create gold schema
CREATE SCHEMA IF NOT EXISTS payer_gold;

-- Gold: Claims with member and provider details
CREATE OR REPLACE TABLE payer_gold.claims_enriched AS
SELECT
  c.claim_id,
  c.claim_date,
  c.total_charge,
  c.claim_status,
  m.member_id,
  m.first_name,
  m.last_name,
  m.gender,
  m.plan_id,
  p.provider_id,
  p.provider_name,
  p.specialty,
  p.city,
  p.state
FROM payer_silver.claims c
INNER JOIN payer_silver.members m ON c.member_id = m.member_id
INNER JOIN payer_silver.providers p ON c.provider_id = p.provider_id;


-- Gold: Claim Aggregates per Member
CREATE OR REPLACE TABLE payer_gold.member_claim_summary AS
SELECT
  member_id,
  COUNT(DISTINCT claim_id) AS total_claims,
  SUM(total_charge) AS sum_claims,
  MAX(total_charge) AS max_claim,
  MIN(total_charge) AS min_claim
FROM payer_silver.claims
GROUP BY member_id;


num_affected_rows,num_inserted_rows


## Statistical Analysis
Let's try statistical analysis using python!

In [0]:
display(spark.table("payer_gold.claims_enriched"))

claim_id,claim_date,total_charge,claim_status,member_id,first_name,last_name,gender,plan_id,provider_id,provider_name,specialty,city,state
CLM006,2023-03-05,150.0,paid,1006,Robert,Stone,M,PLN101,2004,Dr. Davis,Cardiology,Louisville,KY
CLM003,2023-01-12,300.0,paid,1003,Paul,White,M,PLN101,2001,Dr. Adams,Family Practice,Louisville,KY
CLM002,2023-01-11,200.0,denied,1002,Jane,Smith,F,PLN102,2002,Dr. Baker,Internal Medicine,Louisville,KY
CLM004,2023-02-01,180.5,pending,1004,Emily,Jones,F,PLN102,2003,Dr. Clark,Pediatrics,Lexington,KY
CLM001,2023-01-10,120.0,paid,1001,John,Doe,M,PLN101,2001,Dr. Adams,Family Practice,Louisville,KY
CLM008,2023-03-15,189.0,paid,1008,David,Ryan,M,PLN104,2003,Dr. Clark,Pediatrics,Lexington,KY
CLM007,2023-03-10,176.0,denied,1007,Olivia,Simms,F,PLN102,2004,Dr. Davis,Cardiology,Louisville,KY
CLM005,2023-02-13,240.75,paid,1005,Alice,Brown,F,PLN103,2002,Dr. Baker,Internal Medicine,Louisville,KY


**Let's ask Databricks AI Assistant!** \
Example prompt: 
* What kind of aggregations can I do with table "payer_gold.claims_enriched"?
* What does "spark.table" command do?

In [0]:
display(spark.table("payer_gold.claims_enriched").groupBy("claim_status").sum("total_charge"))

claim_status,sum(total_charge)
pending,180.5
paid,999.75
denied,376.0


Databricks visualization. Run in Databricks to view.

In [0]:
display(spark.table("payer_gold.claims_enriched").groupBy("gender").count())

gender,count
M,4
F,4


Databricks visualization. Run in Databricks to view.

In [0]:
display(spark.table("payer_gold.claims_enriched").groupBy("claim_date").sum("total_charge").orderBy("claim_date"))

claim_date,sum(total_charge)
2023-01-10,120.0
2023-01-11,200.0
2023-01-12,300.0
2023-02-01,180.5
2023-02-13,240.75
2023-03-05,150.0
2023-03-10,176.0
2023-03-15,189.0


Databricks visualization. Run in Databricks to view.

In [0]:
display(spark.table("payer_gold.claims_enriched").groupBy("city").count())

city,count
Louisville,6
Lexington,2


Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

In [0]:
display(spark.table("payer_gold.claims_enriched").select("total_charge"))

total_charge
150.0
300.0
200.0
180.5
120.0
189.0
176.0
240.75


Databricks visualization. Run in Databricks to view.

In [0]:
display(spark.table("payer_gold.claims_enriched").select("claim_date", "total_charge"))

claim_date,total_charge
2023-03-05,150.0
2023-01-12,300.0
2023-01-11,200.0
2023-02-01,180.5
2023-01-10,120.0
2023-03-15,189.0
2023-03-10,176.0
2023-02-13,240.75


Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

# AI/BI

Intelligent analytics for everyone!

Databricks AI/BI is a new type of business intelligence product designed to provide a deep understanding of your data's semantics, enabling self-service data analysis for everyone in your organization. AI/BI is built on a compound AI system that draws insights from the full lifecycle of your data across the Databricks platform, including ETL pipelines, lineage, and other queries.

<img src="https://www.databricks.com/sites/default/files/2025-05/hero-image-ai-bi-v2-2x.png?v=1748417271" alt="Managed Tables" width="600" height="500">

# Genie

Talk with your data

Now everyone can get insights from data simply by asking questions in natural language.

<img src="https://www.databricks.com/sites/default/files/2025-06/ai-bi-genie-hero.png?v=1749162682" alt="Managed Tables" width="600" height="500">
