# Healthcare Payer Analytics for SAS/ASA Professionals
**Comprehensive Training Notebook - From SAS to Databricks**

This notebook demonstrates advanced healthcare payer analytics using Databricks, specifically designed for analysts transitioning from SAS/ASA environments. It covers the complete medallion architecture (Bronze → Silver → Gold) with healthcare-specific business intelligence patterns.

**🎯 Learning Objectives:**
- Master medallion data architecture for healthcare data
- Learn SAS procedure equivalents in Databricks SQL
- Implement healthcare KPIs and risk scoring models
- Build executive dashboards and clinical analytics
- Understand modern AI/BI capabilities


## Setup Parameters


In [1]:
# Setup catalog and database parameters
dbutils.widgets.text("catalog", "my_catalog", "Catalog")
dbutils.widgets.text("bronze_db", "payer_bronze", "Bronze DB")
dbutils.widgets.text("silver_db", "payer_silver", "Silver DB")
dbutils.widgets.text("gold_db", "payer_gold", "Gold DB")

catalog = dbutils.widgets.get("catalog")
bronze_db = dbutils.widgets.get("bronze_db")
silver_db = dbutils.widgets.get("silver_db")
gold_db = dbutils.widgets.get("gold_db")

print(f"Catalog: {catalog}")
print(f"Bronze DB: {bronze_db}")
print(f"Silver DB: {silver_db}")
print(f"Gold DB: {gold_db}")


Box(children=(Label(value='Catalog'), Text(value='my_catalog')))

Box(children=(Label(value='Bronze DB'), Text(value='payer_bronze')))

Box(children=(Label(value='Silver DB'), Text(value='payer_silver')))

Box(children=(Label(value='Gold DB'), Text(value='payer_gold')))

Catalog: my_catalog
Bronze DB: payer_bronze
Silver DB: payer_silver
Gold DB: payer_gold


## Initialize Catalogs and Schemas


In [6]:
# Create catalog and databases
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")
spark.sql(f"USE CATALOG {catalog}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {bronze_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {silver_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {gold_db}")

print("✅ Catalog and databases initialized")


✅ Catalog and databases initialized


## Step 1: Extract Data Files
**One-line file extraction using modular utilities**


In [7]:
# Import and execute file extraction with a single function call
import sys
import os

# Add src directory to Python path
src_path = os.path.join(os.getcwd(), 'src')
if src_path not in sys.path:
    sys.path.append(src_path)

# Import and execute the file extraction utility
from file_utils import extract_payer_data

extraction_results = extract_payer_data(
    spark=spark,
    catalog=catalog,
    bronze_db=bronze_db
)

# Simple success check
if extraction_results['success']:
    print("🎉 File extraction completed successfully!")
else:
    print("❌ File extraction failed. Check logs above.")


🚀 Starting payer data extraction...
📁 Creating volume directories...
✅ Created 6 directories

📦 Extracting files using UDF...


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

UDF Results:
Successfully extracted ZIP to temporary directory
Successfully copied claims.csv to /Volumes/my_catalog/payer_bronze/payer/files/claims/claims.csv
Successfully copied diagnoses.csv to /Volumes/my_catalog/payer_bronze/payer/files/diagnosis/diagnosis.csv
Successfully copied procedures.csv to /Volumes/my_catalog/payer_bronze/payer/files/procedures/procedures.csv
Successfully copied member.csv to /Volumes/my_catalog/payer_bronze/payer/files/members/members.csv
Successfully copied providers.csv to /Volumes/my_catalog/payer_bronze/payer/files/providers/providers.csv
🎉 File extraction completed successfully!


## Step 2: Bronze Layer - Raw Data Ingestion
**Create bronze tables from extracted files**


In [8]:
print("Creating bronze tables using DBSQL COPY INTO...")


Creating bronze tables using DBSQL COPY INTO...


In [9]:
%sql
-- Create Claims Bronze Table
CREATE TABLE IF NOT EXISTS payer_bronze.claims_raw;
COPY INTO payer_bronze.claims_raw FROM '/Volumes/my_catalog/payer_bronze/payer/files/claims/claims.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true', 'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
0,0,0,0


In [10]:
%sql
-- Create Members Bronze Table
CREATE TABLE IF NOT EXISTS payer_bronze.members_raw;
COPY INTO payer_bronze.members_raw FROM '/Volumes/my_catalog/payer_bronze/payer/files/members/members.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true', 'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
0,0,0,0


In [13]:
%sql
-- Create Providers Bronze Table
CREATE TABLE IF NOT EXISTS payer_bronze.providers_raw;
COPY INTO payer_bronze.providers_raw FROM '/Volumes/my_catalog/payer_bronze/payer/files/providers/providers.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true', 'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
0,0,0,0


In [14]:
%sql
-- Create Diagnosis Bronze Table  
CREATE TABLE IF NOT EXISTS payer_bronze.diagnosis_raw;
COPY INTO payer_bronze.diagnosis_raw FROM '/Volumes/my_catalog/payer_bronze/payer/files/diagnosis/diagnosis.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true', 'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
0,0,0,0


In [15]:
%sql
-- Create Procedures Bronze Table
CREATE TABLE IF NOT EXISTS payer_bronze.procedures_raw;
COPY INTO payer_bronze.procedures_raw FROM '/Volumes/my_catalog/payer_bronze/payer/files/procedures/procedures.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true', 'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

KeyboardInterrupt: 

## Step 3: Silver Layer - Cleaned and Conformed Data
**Transform bronze data into clean, analytics-ready tables**


In [None]:
%sql
-- Clean and transform members data
CREATE OR REPLACE TABLE payer_silver.members AS
SELECT
  DISTINCT CAST(member_id AS STRING) AS member_id,
  TRIM(first_name) AS first_name,
  TRIM(last_name) AS last_name,
  CAST(birth_date AS DATE) AS birth_date,
  gender,
  plan_id,
  CAST(effective_date AS DATE) AS effective_date
FROM payer_bronze.members_raw
WHERE member_id IS NOT NULL


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows


In [16]:
%sql
-- Clean and transform claims data
CREATE OR REPLACE TABLE payer_silver.claims AS
SELECT
  DISTINCT claim_id,
  member_id,
  provider_id,
  CAST(claim_date AS DATE) AS claim_date,
  ROUND(total_charge, 2) AS total_charge,
  LOWER(claim_status) AS claim_status
FROM payer_bronze.claims_raw
WHERE claim_id IS NOT NULL AND total_charge > 0


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows


In [17]:
%sql
-- Clean and transform providers data
CREATE OR REPLACE TABLE payer_silver.providers AS
SELECT
  DISTINCT provider_id,
  npi,
  provider_name,
  specialty,
  address,
  city,
  state
FROM payer_bronze.providers_raw
WHERE provider_id IS NOT NULL


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows


## Step 4: Gold Layer - Business-Ready Analytics Tables
**Create enriched, aggregated tables for analytics and reporting**


In [18]:
%sql
-- Create enriched claims table with member and provider details
CREATE OR REPLACE TABLE payer_gold.claims_enriched AS
SELECT
  c.claim_id,
  c.claim_date,
  c.total_charge,
  c.claim_status,
  m.member_id,
  m.first_name,
  m.last_name,
  m.gender,
  m.plan_id,
  p.provider_id,
  p.provider_name,
  p.specialty,
  p.city,
  p.state
FROM payer_silver.claims c
INNER JOIN payer_silver.members m ON c.member_id = m.member_id
INNER JOIN payer_silver.providers p ON c.provider_id = p.provider_id


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows


In [19]:
%sql
-- Create member summary table for analytics
CREATE OR REPLACE TABLE payer_gold.member_claim_summary AS
SELECT
  member_id,
  COUNT(DISTINCT claim_id) AS total_claims,
  SUM(total_charge) AS sum_claims,
  AVG(total_charge) AS avg_claim_amount,
  MAX(total_charge) AS max_claim,
  MIN(total_charge) AS min_claim
FROM payer_silver.claims
GROUP BY member_id


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows


## Step 5: Verify Pipeline Results
**Quick validation of the medallion pipeline**


In [20]:
# Display pipeline summary
bronze_count = spark.sql(f"SELECT COUNT(*) as count FROM {bronze_db}.claims_raw").collect()[0]['count']
silver_count = spark.sql(f"SELECT COUNT(*) as count FROM {silver_db}.claims").collect()[0]['count']
gold_count = spark.sql(f"SELECT COUNT(*) as count FROM {gold_db}.claims_enriched").collect()[0]['count']

print("📊 MEDALLION PIPELINE SUMMARY")
print("=" * 40)
print(f"🥉 Bronze Claims: {bronze_count:,} records")
print(f"🥈 Silver Claims: {silver_count:,} records")
print(f"🥇 Gold Claims: {gold_count:,} records")
print("\n🎉 Medallion pipeline completed successfully!")


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

📊 MEDALLION PIPELINE SUMMARY
🥉 Bronze Claims: 8 records
🥈 Silver Claims: 8 records
🥇 Gold Claims: 8 records

🎉 Medallion pipeline completed successfully!


## Step 6: Data Quality Assessment & Profiling
**Comprehensive data quality checks familiar to SAS analysts**


In [None]:
# Data Quality Dashboard - Similar to SAS PROC FREQ and PROC MEANS
from pyspark.sql.functions import *
from pyspark.sql.types import *

def data_quality_report(df, table_name):
    """Comprehensive data quality report similar to SAS procedures"""
    try:
        print(f"\n📊 DATA QUALITY REPORT: {table_name}")
        print("=" * 60)
        
        # Basic statistics
        print("🔄 Calculating row count...")
        row_count = df.count()
        col_count = len(df.columns)
        print(f"📈 Rows: {row_count:,} | Columns: {col_count}")
        
        # Schema info
        print(f"\n📋 SCHEMA INFORMATION:")
        df.printSchema()
        
        # Use a simpler approach with SQL for missing values analysis
        print(f"\n🔍 MISSING VALUES ANALYSIS:")
        print("🔄 Using SQL approach for better compatibility...")
        
        # Create a temporary view for SQL analysis
        temp_view_name = f"temp_{table_name}_dq"
        df.createOrReplaceTempView(temp_view_name)
        
        # Build SQL query to count nulls for all columns
        null_check_queries = []
        for column_name in df.columns:
            # Escape column names that might be SQL keywords
            escaped_col = f"`{column_name}`"
            null_check_queries.append(f"SUM(CASE WHEN {escaped_col} IS NULL THEN 1 ELSE 0 END) as {column_name}_nulls")
        
        sql_query = f"""
        SELECT 
            {', '.join(null_check_queries)}
        FROM {temp_view_name}
        """
        
        print("🔄 Executing null count query...")
        null_counts = spark.sql(sql_query).collect()[0]
        
        # Format results
        print("\n📊 MISSING VALUES SUMMARY:")
        print("-" * 50)
        print(f"{'Column':<20} {'Missing':<10} {'Percent':<10}")
        print("-" * 50)
        
        missing_stats = []
        for column_name in df.columns:
            null_count = null_counts[f"{column_name}_nulls"]
            null_pct = (null_count / row_count * 100) if row_count > 0 else 0
            print(f"{column_name:<20} {null_count:<10} {null_pct:<10.2f}%")
            missing_stats.append((column_name, null_count, round(null_pct, 2)))
        
        print("-" * 50)
        
        # Clean up temp view
        spark.sql(f"DROP VIEW IF EXISTS {temp_view_name}")
        
        return missing_stats
        
    except Exception as func_error:
        print(f"❌ Error in data_quality_report function: {func_error}")
        print(f"📋 Error type: {type(func_error).__name__}")
        import traceback
        print(f"📋 Full traceback: {traceback.format_exc()}")
        return None

# Run data quality report on our gold table
# Check if table exists and debug any issues
try:
    print("🔍 Checking if table exists...")
    
    # First, let's check the catalog and database
    current_catalog = spark.sql("SELECT current_catalog()").collect()[0][0]
    current_database = spark.sql("SELECT current_database()").collect()[0][0]
    print(f"📂 Current catalog: {current_catalog}")
    print(f"📂 Current database: {current_database}")
    
    # Try to access the table
    print("📊 Attempting to access payer_gold.claims_enriched...")
    claims_df = spark.table("payer_gold.claims_enriched")
    print("✅ Table access successful!")
    
    # Run the data quality report
    print("🔬 Running data quality analysis...")
    dq_report = data_quality_report(claims_df, "claims_enriched")
    
except Exception as e:
    print(f"❌ Error details: {str(e)}")
    print(f"📋 Error type: {type(e).__name__}")
    
    # Let's try to list available tables to debug
    try:
        print("\n🔍 Available tables in payer_gold:")
        tables = spark.sql("SHOW TABLES IN payer_gold").collect()
        for table in tables:
            print(f"   - {table.tableName}")
    except Exception as list_error:
        print(f"   Could not list tables: {list_error}")
    
    print("\n💡 Debug steps:")
    print("   1. Verify the catalog is set correctly")
    print("   2. Check if payer_gold database exists")
    print("   3. Ensure claims_enriched table was created successfully")


🔍 Checking if table exists...


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

📂 Current catalog: my_catalog
📂 Current database: default
📊 Attempting to access payer_gold.claims_enriched...
✅ Table access successful!
🔬 Running data quality analysis...

📊 DATA QUALITY REPORT: claims_enriched
🔄 Calculating row count...


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

📈 Rows: 8 | Columns: 14

📋 SCHEMA INFORMATION:
root
 |-- claim_id: string (nullable = true)
 |-- claim_date: date (nullable = true)
 |-- total_charge: double (nullable = true)
 |-- claim_status: string (nullable = true)
 |-- member_id: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- plan_id: string (nullable = true)
 |-- provider_id: integer (nullable = true)
 |-- provider_name: string (nullable = true)
 |-- specialty: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)


🔍 MISSING VALUES ANALYSIS:
🔄 Analyzing missing values for each column...
   Processing column 1/14: claim_id


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column claim_id: 
   Processing column 2/14: claim_date


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column claim_date: 
   Processing column 3/14: total_charge


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column total_charge: 
   Processing column 4/14: claim_status


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column claim_status: 
   Processing column 5/14: member_id


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column member_id: 
   Processing column 6/14: first_name


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column first_name: 
   Processing column 7/14: last_name


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column last_name: 
   Processing column 8/14: gender


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column gender: 
   Processing column 9/14: plan_id


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column plan_id: 
   Processing column 10/14: provider_id


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column provider_id: 
   Processing column 11/14: provider_name


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column provider_name: 
   Processing column 12/14: specialty


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column specialty: 
   Processing column 13/14: city


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column city: 
   Processing column 14/14: state


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

   ⚠️  Error processing column state: 
🔄 Creating missing values summary...
🔄 Displaying results...


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

+-------------+-------------+---------------+
|Column       |Missing_Count|Missing_Percent|
+-------------+-------------+---------------+
|claim_date   |-1           |-1             |
|total_charge |-1           |-1             |
|first_name   |-1           |-1             |
|last_name    |-1           |-1             |
|plan_id      |-1           |-1             |
|provider_id  |-1           |-1             |
|city         |-1           |-1             |
|state        |-1           |-1             |
|claim_id     |-1           |-1             |
|claim_status |-1           |-1             |
|member_id    |-1           |-1             |
|gender       |-1           |-1             |
|provider_name|-1           |-1             |
|specialty    |-1           |-1             |
+-------------+-------------+---------------+



### Numerical Data Profiling (Similar to SAS PROC MEANS)


In [24]:
%sql
-- Comprehensive statistical summary for numerical variables (like SAS PROC MEANS)
SELECT 

  'total_charge' as variable,
  COUNT(*) as n,
  COUNT(total_charge) as n_non_missing,
  ROUND(AVG(total_charge), 2) as mean,
  ROUND(STDDEV(total_charge), 2) as std_dev,
  MIN(total_charge) as minimum,
  ROUND(PERCENTILE_APPROX(total_charge, 0.25), 2) as q1,
  ROUND(PERCENTILE_APPROX(total_charge, 0.50), 2) as median,
  ROUND(PERCENTILE_APPROX(total_charge, 0.75), 2) as q3,
  MAX(total_charge) as maximum,
  ROUND((PERCENTILE_APPROX(total_charge, 0.75) - PERCENTILE_APPROX(total_charge, 0.25)), 2) as iqr
FROM payer_gold.claims_enriched


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,variable,n,n_non_missing,mean,std_dev,minimum,q1,median,q3,maximum,iqr
0,total_charge,8,8,194.53,55.29,120.0,150.0,180.5,200.0,300.0,50.0


## Step 7: Advanced Healthcare Analytics
**Business intelligence patterns familiar to payer analysts**

### SAS-Style Cross Tabulation Analysis (PROC FREQ equivalent)


In [22]:
# Cross-tabulation with percentages (like SAS PROC FREQ)
from pyspark.sql.functions import *

def cross_tab_analysis(df, row_var, col_var, value_var=None):
    """Create cross-tabulation similar to SAS PROC FREQ"""
    if value_var:
        # With values (like SAS PROC FREQ with weight)
        crosstab = df.groupBy(row_var, col_var).agg(
            count("*").alias("frequency"),
            sum(value_var).alias("sum_value"),
            avg(value_var).alias("avg_value")
        )
    else:
        # Simple frequency count
        crosstab = df.groupBy(row_var, col_var).count().alias("frequency")
    
    return crosstab

# Claims by Status and Gender (similar to SAS: PROC FREQ DATA=claims; TABLES gender*claim_status;)
crosstab_result = cross_tab_analysis(claims_df, "gender", "claim_status", "total_charge")
display(crosstab_result.orderBy("gender", "claim_status"))


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,gender,claim_status,frequency,sum_value,avg_value
0,F,denied,2,376.0,188.0
1,F,paid,1,240.75,240.75
2,F,pending,1,180.5,180.5
3,M,paid,4,759.0,189.75


### Healthcare Utilization Analysis
**Key metrics for payer operations**


In [None]:
-- Member utilization patterns (Per Member Per Month - PMPM calculations)
WITH member_utilization AS (
  SELECT 
    member_id,
    gender,
    plan_id,
    COUNT(DISTINCT claim_id) as total_claims,
    COUNT(DISTINCT MONTH(claim_date)) as active_months,
    SUM(total_charge) as total_cost,
    AVG(total_charge) as avg_claim_cost,
    MAX(total_charge) as max_claim_cost,
    COUNT(DISTINCT provider_id) as unique_providers,
    COUNT(DISTINCT specialty) as unique_specialties
  FROM payer_gold.claims_enriched
  GROUP BY member_id, gender, plan_id
),
utilization_metrics AS (
  SELECT 
    *,
    CASE 
      WHEN active_months > 0 THEN ROUND(total_cost / active_months, 2)
      ELSE 0 
    END as pmpm_cost,
    CASE 
      WHEN active_months > 0 THEN ROUND(total_claims / active_months, 2)
      ELSE 0 
    END as pmpm_claims,
    CASE 
      WHEN total_claims > 5 THEN 'High Utilizer'
      WHEN total_claims > 2 THEN 'Medium Utilizer'
      ELSE 'Low Utilizer'
    END as utilization_category
  FROM member_utilization
)
SELECT * FROM utilization_metrics ORDER BY total_cost DESC


SyntaxError: invalid syntax (480258283.py, line 2)

### Risk Scoring & Segmentation
**Predictive analytics for member risk assessment**


In [None]:
-- Risk scoring based on multiple factors (similar to SAS scoring models)
CREATE OR REPLACE TABLE payer_gold.member_risk_scores AS
WITH member_metrics AS (
  SELECT 
    member_id,
    gender,
    plan_id,
    COUNT(DISTINCT claim_id) as claim_count,
    SUM(total_charge) as total_cost,
    AVG(total_charge) as avg_claim_cost,
    COUNT(DISTINCT provider_id) as provider_count,
    COUNT(DISTINCT specialty) as specialty_count,
    COUNT(CASE WHEN claim_status = 'DENIED' THEN 1 END) as denied_claims,
    DATEDIFF(CURRENT_DATE(), MAX(claim_date)) as days_since_last_claim
  FROM payer_gold.claims_enriched
  GROUP BY member_id, gender, plan_id
),
risk_calculations AS (
  SELECT 
    *,
    -- Risk factors with weights (customize based on actuarial analysis)
    (claim_count * 0.3) + 
    (LEAST(total_cost / 1000, 10) * 0.4) + 
    (provider_count * 0.2) + 
    (denied_claims * 0.1) as raw_risk_score
  FROM member_metrics
),
risk_segments AS (
  SELECT 
    *,
    ROUND(raw_risk_score, 2) as risk_score,
    NTILE(5) OVER (ORDER BY raw_risk_score) as risk_quintile,
    CASE 
      WHEN raw_risk_score > 8 THEN 'Very High Risk'
      WHEN raw_risk_score > 6 THEN 'High Risk'
      WHEN raw_risk_score > 4 THEN 'Medium Risk'
      WHEN raw_risk_score > 2 THEN 'Low Risk'
      ELSE 'Very Low Risk'
    END as risk_category
  FROM risk_calculations
)
SELECT * FROM risk_segments ORDER BY risk_score DESC;

SELECT * FROM payer_gold.member_risk_scores LIMIT 10


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,claim_date,sum(total_charge)
0,2023-01-10,120.0
1,2023-01-11,200.0
2,2023-01-12,300.0
3,2023-02-01,180.5
4,2023-02-13,240.75
5,2023-03-05,150.0
6,2023-03-10,176.0
7,2023-03-15,189.0


### Executive Dashboard KPIs
**Key performance indicators for payer operations**


In [None]:
-- Executive KPI Dashboard (similar to SAS Enterprise Guide dashboards)
CREATE OR REPLACE TABLE payer_gold.executive_kpis AS
WITH kpi_calculations AS (
  SELECT 
    COUNT(DISTINCT member_id) as total_members,
    COUNT(DISTINCT claim_id) as total_claims,
    COUNT(DISTINCT provider_id) as total_providers,
    SUM(total_charge) as total_medical_costs,
    AVG(total_charge) as avg_claim_cost,
    
    -- Claims processing metrics
    COUNT(CASE WHEN claim_status = 'PAID' THEN 1 END) as paid_claims,
    COUNT(CASE WHEN claim_status = 'DENIED' THEN 1 END) as denied_claims,
    COUNT(CASE WHEN claim_status = 'PENDING' THEN 1 END) as pending_claims,
    
    -- Financial metrics
    SUM(CASE WHEN claim_status = 'PAID' THEN total_charge ELSE 0 END) as paid_amount,
    SUM(CASE WHEN claim_status = 'DENIED' THEN total_charge ELSE 0 END) as denied_amount,
    
    -- Utilization by gender
    COUNT(CASE WHEN gender = 'M' THEN 1 END) as male_claims,
    COUNT(CASE WHEN gender = 'F' THEN 1 END) as female_claims,
    
    -- Top specialty costs
    COUNT(CASE WHEN specialty = 'Cardiology' THEN 1 END) as cardiology_claims,
    COUNT(CASE WHEN specialty = 'Family Practice' THEN 1 END) as family_practice_claims
    
  FROM payer_gold.claims_enriched
),
calculated_kpis AS (
  SELECT 
    *,
    ROUND((paid_claims * 100.0 / total_claims), 2) as paid_claims_rate,
    ROUND((denied_claims * 100.0 / total_claims), 2) as denial_rate,
    ROUND((pending_claims * 100.0 / total_claims), 2) as pending_rate,
    ROUND((total_medical_costs / total_members), 2) as cost_per_member,
    ROUND((total_claims * 1.0 / total_members), 2) as claims_per_member
  FROM kpi_calculations
)
SELECT * FROM calculated_kpis;

SELECT * FROM payer_gold.executive_kpis


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,city,count
0,Louisville,6
1,Lexington,2


## Step 8: SAS to Databricks Translation Guide
**Common SAS procedures and their Databricks equivalents**


In [None]:
# SAS to Databricks Translation Examples for Healthcare Analytics

def show_sas_translation():
    """Show common SAS procedures and their Databricks equivalents"""
    
    print("🔄 SAS TO DATABRICKS TRANSLATION GUIDE")
    print("=" * 60)
    
    translations = [
        {
            "sas_proc": "PROC FREQ",
            "sas_code": "PROC FREQ DATA=claims; TABLES gender*claim_status;",
            "databricks_sql": """
SELECT gender, claim_status, COUNT(*) as frequency
FROM payer_gold.claims_enriched 
GROUP BY gender, claim_status
ORDER BY gender, claim_status""",
            "description": "Cross-tabulation frequency analysis"
        },
        {
            "sas_proc": "PROC MEANS",
            "sas_code": "PROC MEANS DATA=claims N MEAN STD MIN MAX; VAR total_charge;",
            "databricks_sql": """
SELECT 
  COUNT(*) as N,
  ROUND(AVG(total_charge), 2) as MEAN,
  ROUND(STDDEV(total_charge), 2) as STD,
  MIN(total_charge) as MIN,
  MAX(total_charge) as MAX
FROM payer_gold.claims_enriched""",
            "description": "Descriptive statistics"
        },
        {
            "sas_proc": "PROC SORT",
            "sas_code": "PROC SORT DATA=claims; BY member_id claim_date;",
            "databricks_sql": """
SELECT * FROM payer_gold.claims_enriched
ORDER BY member_id, claim_date""",
            "description": "Data sorting"
        },
        {
            "sas_proc": "PROC SQL",
            "sas_code": "PROC SQL; CREATE TABLE summary AS SELECT member_id, SUM(total_charge) as total FROM claims GROUP BY member_id;",
            "databricks_sql": """
CREATE OR REPLACE TABLE summary AS
SELECT member_id, SUM(total_charge) as total 
FROM payer_gold.claims_enriched 
GROUP BY member_id""",
            "description": "SQL-based data manipulation"
        }
    ]
    
    for i, trans in enumerate(translations, 1):
        print(f"\n{i}. {trans['sas_proc']} - {trans['description']}")
        print("-" * 40)
        print("SAS Code:")
        print(trans['sas_code'])
        print("\nDatabricks Equivalent:")
        print(trans['databricks_sql'])
        print()

show_sas_translation()


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,total_charge
0,120.0
1,180.5
2,200.0
3,176.0
4,300.0
5,240.75
6,189.0
7,150.0


### Advanced Statistical Analysis
**Healthcare-specific statistical measures**


In [None]:
-- Healthcare Industry KPIs and Statistical Analysis
WITH healthcare_metrics AS (
  SELECT 
    -- Medical Loss Ratio (MLR) components
    plan_id,
    gender,
    specialty,
    
    -- Cost metrics
    COUNT(DISTINCT claim_id) as claim_volume,
    SUM(total_charge) as total_medical_costs,
    AVG(total_charge) as avg_cost_per_claim,
    PERCENTILE_APPROX(total_charge, 0.90) as cost_90th_percentile,
    
    -- Utilization metrics  
    COUNT(DISTINCT member_id) as unique_members,
    COUNT(DISTINCT provider_id) as unique_providers,
    
    -- Quality metrics
    COUNT(CASE WHEN claim_status = 'DENIED' THEN 1 END) as denied_claims_count,
    COUNT(CASE WHEN claim_status = 'PAID' THEN 1 END) as paid_claims_count,
    
    -- Time-based metrics
    AVG(DATEDIFF(CURRENT_DATE(), claim_date)) as avg_days_since_service
    
  FROM payer_gold.claims_enriched
  GROUP BY plan_id, gender, specialty
),
calculated_metrics AS (
  SELECT 
    *,
    -- Healthcare KPIs
    ROUND(total_medical_costs / unique_members, 2) as cost_per_member,
    ROUND(claim_volume * 1.0 / unique_members, 2) as claims_per_member,
    ROUND(denied_claims_count * 100.0 / (denied_claims_count + paid_claims_count), 2) as denial_rate_pct,
    
    -- Statistical measures
    ROUND(cost_90th_percentile / avg_cost_per_claim, 2) as cost_variability_ratio,
    
    -- Risk indicators
    CASE 
      WHEN denied_claims_count * 100.0 / (denied_claims_count + paid_claims_count) > 20 THEN 'High Denial Risk'
      WHEN total_medical_costs / unique_members > 500 THEN 'High Cost Risk'
      ELSE 'Normal Risk'
    END as risk_flag
    
  FROM healthcare_metrics
  WHERE unique_members > 0  -- Avoid division by zero
)
SELECT * FROM calculated_metrics 
ORDER BY cost_per_member DESC


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,claim_date,total_charge
0,2023-01-10,120.0
1,2023-02-01,180.5
2,2023-01-11,200.0
3,2023-03-10,176.0
4,2023-01-12,300.0
5,2023-02-13,240.75
6,2023-03-15,189.0
7,2023-03-05,150.0


## Step 9: Clinical Data Integration & Analysis
**Working with diagnosis and procedure data**


### Create Enhanced Clinical Gold Tables


In [None]:
-- First, create clean diagnosis and procedure tables in Silver layer
CREATE OR REPLACE TABLE payer_silver.diagnosis AS
SELECT DISTINCT
  claim_id,
  diagnosis_code,
  TRIM(diagnosis_desc) as diagnosis_desc
FROM payer_bronze.diagnosis_raw
WHERE claim_id IS NOT NULL AND diagnosis_code IS NOT NULL;

CREATE OR REPLACE TABLE payer_silver.procedures AS  
SELECT DISTINCT
  claim_id,
  procedure_code,
  TRIM(procedure_desc) as procedure_desc,
  CAST(REGEXP_REPLACE(amount, '[^0-9.]', '') AS DECIMAL(10,2)) as procedure_amount
FROM payer_bronze.procedures_raw
WHERE claim_id IS NOT NULL AND procedure_code IS NOT NULL;


In [None]:
-- Create comprehensive clinical analytics table
CREATE OR REPLACE TABLE payer_gold.clinical_analytics AS
SELECT 
  c.claim_id,
  c.member_id,
  c.provider_id,
  c.claim_date,
  c.total_charge,
  c.claim_status,
  c.gender,
  c.plan_id,
  c.specialty,
  
  -- Diagnosis information
  d.diagnosis_code,
  d.diagnosis_desc,
  
  -- Procedure information  
  p.procedure_code,
  p.procedure_desc,
  p.procedure_amount,
  
  -- Clinical classifications
  CASE 
    WHEN d.diagnosis_code LIKE 'E%' THEN 'Endocrine/Metabolic'
    WHEN d.diagnosis_code LIKE 'I%' THEN 'Circulatory System'
    WHEN d.diagnosis_code LIKE 'J%' THEN 'Respiratory System' 
    WHEN d.diagnosis_code LIKE 'M%' THEN 'Musculoskeletal'
    WHEN d.diagnosis_code LIKE 'F%' THEN 'Mental/Behavioral'
    WHEN d.diagnosis_code LIKE 'N%' THEN 'Genitourinary'
    ELSE 'Other'
  END as diagnosis_category,
  
  -- Procedure categories
  CASE 
    WHEN p.procedure_code LIKE '99%' THEN 'Office/Outpatient Visits'
    WHEN p.procedure_code LIKE '80%' OR p.procedure_code LIKE '87%' THEN 'Lab/Diagnostic'
    WHEN p.procedure_code LIKE '71%' OR p.procedure_code LIKE '93%' THEN 'Radiology/Cardiology'
    WHEN p.procedure_code LIKE '36%' THEN 'Blood Services'
    ELSE 'Other Procedures'
  END as procedure_category

FROM payer_gold.claims_enriched c
LEFT JOIN payer_silver.diagnosis d ON c.claim_id = d.claim_id  
LEFT JOIN payer_silver.procedures p ON c.claim_id = p.claim_id


### Clinical Intelligence Reports


In [None]:
-- Top diagnosis categories by cost and volume
SELECT 
  diagnosis_category,
  COUNT(DISTINCT claim_id) as claim_count,
  COUNT(DISTINCT member_id) as member_count,
  SUM(total_charge) as total_cost,
  ROUND(AVG(total_charge), 2) as avg_cost_per_claim,
  ROUND(SUM(total_charge) / COUNT(DISTINCT member_id), 2) as cost_per_member
FROM payer_gold.clinical_analytics
WHERE diagnosis_category IS NOT NULL
GROUP BY diagnosis_category
ORDER BY total_cost DESC


## Step 10: AI/BI Integration & Next Steps
**Modern analytics capabilities for healthcare payers**


### Databricks AI/BI Features for Healthcare Analytics

**Intelligent analytics for everyone!**

Databricks AI/BI enables self-service data analysis with:

🤖 **Natural Language Queries**: Ask questions in plain English
- "What's the average cost per member by plan?"
- "Which diagnosis codes have the highest denial rates?"
- "Show me utilization trends by specialty"

📊 **Auto-Generated Dashboards**: Create visualizations automatically from your data

🔍 **Semantic Layer**: AI understands your healthcare data context

![AI/BI](https://www.databricks.com/sites/default/files/2025-05/hero-image-ai-bi-v2-2x.png)

### Genie - Conversational Analytics

Now everyone can get insights from data simply by asking questions in natural language.

![Genie](https://www.databricks.com/sites/default/files/2025-06/ai-bi-genie-hero.png)

### Try These AI Assistant Prompts:
* "What kind of aggregations can I do with table 'payer_gold.claims_enriched'?"
* "Show me the top 5 most expensive claims by specialty"
* "Which provider specialty has the highest average claim amount?"
* "Create a chart showing denial rates by diagnosis category"
* "What are the key performance indicators for this payer dataset?"


### Healthcare Analytics Best Practices for SAS/ASA Professionals

**📋 Transition Guide from SAS to Databricks:**

1. **Data Processing Philosophy**
   - SAS: Procedural, step-by-step processing
   - Databricks: SQL-first with Spark parallel processing

2. **Key Advantages for Healthcare Payers:**
   - ⚡ **Scale**: Handle millions of claims instantly
   - 💰 **Cost**: Pay only for what you use
   - 🔄 **Real-time**: Stream processing for live claims
   - 🤝 **Collaboration**: Share notebooks across teams
   - 🛡️ **Security**: Healthcare-grade compliance (HIPAA)

3. **Migration Strategy:**
   - Start with familiar SQL syntax
   - Gradually adopt Spark functions for complex transformations
   - Use Databricks AutoML for predictive models
   - Leverage AI/BI for self-service analytics

4. **Next Steps:**
   - Implement real-time claims processing
   - Add machine learning for fraud detection
   - Create automated regulatory reporting
   - Build patient journey analytics
   - Integrate with claims systems via APIs

**💡 Remember**: This notebook demonstrates core patterns. Scale these approaches to your production healthcare data volumes!
