# 🎓 Databricks for Actuaries: Healthcare Analytics Workshop
**A Hands-On Guide for SAS Users Transitioning to Databricks**

---

## 👋 Welcome Actuaries!

This workshop is designed specifically for **actuaries** who are familiar with **SAS** and want to learn Databricks. Don't worry if you're new to Python or SQL - we'll guide you step by step!

## 📚 Workshop Objectives

By the end of this workshop, you will be able to:

1. ✅ Understand how Databricks compares to your SAS workflows
2. ✅ Load and query healthcare payer data using **simple SQL**
3. ✅ Perform actuarial analyses you're familiar with (loss ratios, trends, reserving)
4. ✅ Create interactive visualizations without complex code
5. ✅ Build analytics tables for pricing, reserving, and risk management

---

## 🎯 What Makes This Workshop Different?

- **Beginner-Friendly**: No prior PySpark/Python experience needed
- **Actuarial-Focused**: Examples from pricing, reserving, and risk management
- **SAS Comparisons**: See how familiar SAS code translates to Databricks
- **Interactive Exercises**: Learn by doing with real actuarial problems
- **Hands-On**: Less theory, more practice!

---

### 🏥 Dataset Overview

We'll work with **healthcare payer data** including:
- **Members**: Health plan enrollees (like your policy data)
- **Claims**: Medical claim submissions (incurred losses)
- **Providers**: Healthcare providers (similar to provider networks)
- **Diagnoses**: Diagnosis codes from claims
- **Procedures**: Medical procedures performed

**Think of it as**: Claims = Losses, Members = Policies, Providers = Service Providers

---



## 📑 Table of Contents

### Part 1: Getting Started (Quick Setup)
1. **[Introduction for Actuaries](#introduction)**
   - Databricks vs SAS: What's Different?
   - Quick Tour of the Platform
   - Your First Query

2. **[Setup (We'll Do This Together)](#setup)**
   - Loading Your Data (Bronze & Silver - Simplified)
   - Checking Your Tables

### Part 2: Actuarial Analytics (The Fun Part! 🎉)

3. **[Gold Layer - Actuarial Analytics](#gold-layer)**
   - **Loss Ratio Analysis** ⭐
   - **Claims Development & Trending** ⭐
   - **Frequency & Severity Metrics** ⭐
   - **Risk Segmentation & Scoring** ⭐
   - **IBNR Indicators** ⭐

4. **[Interactive Exercises](#exercises)**
   - 🎯 Exercise 1: Calculate Loss Ratios by Segment
   - 🎯 Exercise 2: Trend Analysis (Like PROC EXPAND!)
   - 🎯 Exercise 3: Age-to-Age Development Factors
   - 🎯 Exercise 4: Premium Adequacy Analysis
   - 🎯 Exercise 5: Risk Score Modeling
   - 🎯 Exercise 6: Claims Development Triangle

5. **[Visualizations for Actuaries](#visualizations)**
   - Loss Ratio Charts
   - Development Triangles
   - Trend Lines
   - Distribution Analysis

6. **[SAS to Databricks Quick Reference](#sas-reference)**
   - Common SAS Procedures → SQL/PySpark
   - PROC SQL → Databricks SQL
   - PROC MEANS → Aggregations
   - PROC FREQ → GROUP BY
   - Data Steps → Transformations

7. **[Next Steps](#next-steps)**
   - Taking This Back to Your Work
   - Resources for Actuaries
   - Getting Help

---

> **💡 Workshop Flow**: We'll move quickly through setup, then spend most of our time on actuarial analytics. Feel free to ask questions anytime!
    - Continue Learning
    - Resources & Certifications

---

> **💡 Workshop Tip**: This is a hands-on workshop! Execute each cell as you go and experiment with the code. Don't hesitate to ask questions or use the Databricks AI Assistant.

---


# Introduction for Actuaries

## 🤔 Why Databricks for Actuaries?

If you're coming from **SAS**, you might be wondering: "Why learn another tool?"

### Here's Why:
- **Scalability**: Handle millions of claims instantly (no more waiting for PROC SQL!)
- **Modern Analytics**: Built-in ML, real-time dashboards, and collaboration
- **Cost-Effective**: Cloud-based, pay only for what you use
- **Still Use SQL**: 90% of your SAS PROC SQL knowledge transfers directly!

---

## 🔄 SAS vs Databricks: Quick Comparison

| **What You Do in SAS** | **How You Do It in Databricks** | **Difficulty** |
|------------------------|----------------------------------|----------------|
| `PROC SQL` | SQL queries (almost identical!) | ⭐ Easy |
| `PROC MEANS` | `GROUP BY` + aggregate functions | ⭐ Easy |
| `PROC FREQ` | `GROUP BY` + `COUNT()` | ⭐ Easy |
| `DATA` step | SQL `SELECT` or simple Python | ⭐⭐ Moderate |
| `PROC EXPAND` (trending) | Window functions | ⭐⭐ Moderate |
| Macros | Parameters + reusable queries | ⭐⭐⭐ Learning curve |

**Good News**: Most of what you do can be done with **SQL alone**!

---

## 🏗️ Quick Concept: Medallion Architecture (Simplified)

Think of it like your SAS workflow:

```
📥 BRONZE (Raw Data)        →  Like your input datasets from source systems
   ↓
🔧 SILVER (Clean Data)      →  Like your cleaned/standardized datasets  
   ↓
⭐ GOLD (Analytics Tables)   →  Like your final reporting/analysis datasets
```

**Today's Focus**: We'll quickly load Bronze/Silver, then spend most time on **Gold** (actuarial analytics)!

---


## 🏠 What is a Lakehouse? (Simple Explanation)

**For Actuaries**: Think of it as a **super-powered SAS library** that:
- Stores all your data in one place (claims, policies, members)
- Lets you analyze it with SQL (like PROC SQL)
- Handles millions of rows instantly
- Keeps track of all changes (audit trail)
- Lets multiple people work at once (no locking issues!)

**Key Benefit**: Unlike SAS datasets, you can query **billions** of claims in seconds!

<img src="https://www.databricks.com/wp-content/uploads/2020/01/data-lakehouse-new.png" alt="Lakehouse" width="500" height="350">

---

## 📚 Unity Catalog (Data Organization - Like SAS Libraries)

**For Actuaries**: Think of Unity Catalog as your **SAS library structure**, but better organized:

```
In SAS:                    In Databricks:
LIBNAME.DATASET            CATALOG.SCHEMA.TABLE
  ↓                           ↓
work.claims        →       my_catalog.payer_bronze.claims
actuarial.loss_ratios →    my_catalog.payer_gold.loss_ratios
```

**Benefits:**
- ✅ Everyone sees the same data (no duplicate datasets!)
- ✅ Built-in security (control who can see PHI/PII)
- ✅ Audit trail (track all data access)
- ✅ Easy to find data (searchable catalog)

<img src="https://www.databricks.com/sites/default/files/2025-05/header-unity-catalog.png?v=1748513086" alt="Unity Catalog" width="500" height="300">

---






## 🥉🥈🥇 Medallion Architecture (Your Data Quality Layers)

**For Actuaries**: This is like your SAS data prep workflow, but organized into layers:

### 📥 Bronze (Raw Data) 
- **Like**: Your raw claims extracts from source systems
- **Contains**: Data exactly as received (CSV, database extracts)
- **Example**: Raw claims file from claims system
- **Today**: We'll load this quickly!

### 🔧 Silver (Cleaned Data)
- **Like**: Your cleaned/standardized SAS datasets
- **Contains**: Deduplicated, standardized data
- **Example**: Claims with proper data types, duplicates removed
- **Today**: We'll auto-clean this!

### ⭐ Gold (Analytics Tables)
- **Like**: Your final analysis datasets (loss triangles, premium summaries)
- **Contains**: Business-ready tables for actuarial analysis
- **Example**: Loss ratios, development factors, IBNR estimates
- **Today**: This is where we'll spend most time! 🎉

<img src="https://www.databricks.com/sites/default/files/inline-images/building-data-pipelines-with-delta-lake-120823.png?v=1702318922" alt="Medallion Architecture" width="500" height="350">

---

> **💡 For This Workshop**: We'll speed through Bronze/Silver (15 min) so we can focus on Gold analytics (1.5 hours)!

## Managed tables

[How Unity Catalog Managed Tables Automate Performance at Scale](https://www.databricks.com/blog/how-unity-catalog-managed-tables-automate-performance-scale) with [Predictive Optimization](https://learn.microsoft.com/en-us/azure/databricks/optimizations/predictive-optimization)


<!-- ![](https://www.databricks.com/sites/default/files/inline-images/image2_48.png?v=1751297384) -->

<img src="https://www.databricks.com/sites/default/files/inline-images/image2_48.png?v=1751297384" alt="Managed Tables" width="600" height="500">


[Faster Queries: 20X query latency reduction](https://www.databricks.com/blog/predictive-optimization-automatically-delivers-faster-queries-and-lower-tco)

**Predictive Optimization** in Databricks automates table management by leveraging Unity Catalog and the Data Intelligence Platform. This innovative feature currently runs the following optimizations for Unity Catalog managed tables:

* **OPTIMIZE** - Triggers incremental clustering for enabled tables. Improves query performance by optimizing file sizes.
* **VACUUM** - Reduces storage costs by deleting data files no longer referenced by the table.
* **ANALYZE** - Triggers incremental update of statistics to improve query performance. 


<img src="https://www.databricks.com/sites/default/files/styles/max_1000x1000/public/2024-05/db-976-blog-img-og.png?itok=qWBT8VA-&v=1717158571" alt="Managed Tables" width="600" height="500">

**Compaction** - This enhances query performance by optimizing file sizes, ensuring that data retrieval is efficient.

**Liquid Clustering** - This technique incrementally clusters incoming data, enabling optimal data layout and efficient data skipping.



# Databricks Medallion Pipeline for a Healthcare Payer


## Modeling Concepts

Databricks fully supports both **dimensional modeling** (Kimball/star schema) and **Inmon-style, Data Vault architectures (hubs, satellites, links)** on the Lakehouse platform. For dimensional models, you can build classic star and snowflake schemas directly with SQL, benefiting from ACID transactions and scalable Delta Lake tables.

For Inmon/Data Vault use cases, Databricks provides rich support for hub-and-satellite models that address core enterprise needs for history, auditability, and extensibility—find end.

The Lakehouse approach lets you mix these styles as needed within a single platform, so you can incrementally land data in Raw Vault/EDW structures and later expose it as dimensional marts—all with Delta Live Tables, fine-grained security, and open formats.

Key blog resources:

[Implementing Dimensional Modeling](https://www.databricks.com/blog/implementing-dimensional-data-warehouse-databricks-sql-part-1)

[Implementing Data Vault/Hub-Satellite](https://www.databricks.com/blog/2022/06/24/prescriptive-guidance-for-implementing-a-data-vault-model-on-the-databricks-lakehouse-platform.html) 

[Data Vault Best Practices](https://www.databricks.com/blog/data-vault-best-practice-implementation-lakehouse)

<div style="display: flex; justify-content: space-between;">
  <img src="https://user-gen-media-assets.s3.amazonaws.com/gpt4o_images/5c87faea-3e60-4f71-826d-42d04f6cdc0b.png" alt="Managed Tables" width="400" height="350">
  <img src="https://user-gen-media-assets.s3.amazonaws.com/gpt4o_images/6826c275-d462-4c07-a978-43fe9c40f3ed.png" alt="Managed Tables" width="400" height="350">
</div>






## Sample Data Model

For a payer, commonly used tables include:

- **Members**: members enrolled in a health plan
- **Claims**: medical claim submissions
- **Providers**: healthcare providers (doctors, clinics)
- **Diagnoses**: claim diagnosis codes
- **Procedures**: procedures/services performed

Each table should have at least 50 rows.

<img src="https://user-gen-media-assets.s3.amazonaws.com/gpt4o_images/bdd54dc0-f3c7-4975-80a3-0017ebdb121c.png" alt="Managed Tables" width="400" height="300">





## Table	Key Columns

**Members**	member_id, first_name, last_name, birth_date, gender, plan_id, effective_date

**Claims**	claim_id, member_id, provider_id, claim_date, total_charge, claim_status

**Providers**	provider_id, npi, provider_name, specialty, address, city, state

**Diagnoses**	claim_id, diagnosis_code, diagnosis_desc

**Procedures**	claim_id, procedure_code, procedure_desc, amount

# SETUP

In [0]:
dbutils.widgets.text("catalog", "my_catalog", "Catalog")
dbutils.widgets.text("bronze_db", "payer_bronze", "Bronze DB")
dbutils.widgets.text("silver_db", "payer_silver", "Silver DB")
dbutils.widgets.text("gold_db", "payer_gold", "Gold DB")

catalog = dbutils.widgets.get("catalog")
bronze_db = dbutils.widgets.get("bronze_db")
silver_db = dbutils.widgets.get("silver_db")
gold_db = dbutils.widgets.get("gold_db")

path = f"/Volumes/{catalog}/{bronze_db}/payer/files/"

In [0]:
print(f"Catalog: {catalog}")
print(f"Bronze DB: {bronze_db}")
print(f"Silver DB: {silver_db}")
print(f"Gold DB: {gold_db}")
print(f"Path: {path}")

In [0]:
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")

In [0]:
spark.sql(f"USE CATALOG {catalog}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {bronze_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {silver_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {gold_db}")

Create new **Volumes** as below and upload shared files to your volumes.

(schema) payer_bronze \
|--- payer/files/ \
|------ claims \
|------ members \
|------ providers \
|------ diagnoses \
|------ procedures


In [0]:
spark.sql(f"CREATE VOLUME IF NOT EXISTS {bronze_db}.payer")

In [0]:
# Create the volume and folders
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/claims")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/diagnosis")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/procedures")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/members")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/files/providers")
dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/payer/downloads")

In [0]:
import requests
import zipfile
import io
import os
import shutil

# Define the URL of the ZIP file
url = "https://github.com/bigdatavik/databricksfirststeps/blob/6b225621c3c010a2734ab604efd79c15ec6c71b8/data/Payor_Archive.zip?raw=true"

# Download the ZIP file
response = requests.get(url)
zip_file = zipfile.ZipFile(io.BytesIO(response.content))

# Define the base path
base_path = f"/Volumes/{catalog}/{bronze_db}/payer/downloads" 

# Extract the ZIP file to the base path
zip_file.extractall(base_path)

# Define the paths
paths = {
    "claims.csv": f"{base_path}/claims",
    "diagnoses.csv": f"{base_path}/diagnosis",
    "procedures.csv": f"{base_path}/procedures",
    "member.csv": f"{base_path}/members",
    "providers.csv": f"{base_path}/providers"
}

# Create the destination directories if they do not exist
for dest_path in paths.values():
    os.makedirs(dest_path, exist_ok=True)

# Move the files to the respective directories
for file_name, dest_path in paths.items():
    source_file = f"{base_path}/{file_name}"
    if os.path.exists(source_file):
        os.rename(source_file, f"{dest_path}/{file_name}")



In [0]:
%python
# Copy the files to the specified directories and print the paths
shutil.copy(f"{base_path}/claims/claims.csv", f"/Volumes/{catalog}/{bronze_db}/payer/files/claims/claims.csv")
print(f"Copied to /Volumes/{catalog}/{bronze_db}/payer/files/claims/claims.csv")

shutil.copy(f"{base_path}/diagnosis/diagnoses.csv", f"/Volumes/{catalog}/{bronze_db}/payer/files/diagnosis/diagnosis.csv")
print(f"Copied to /Volumes/{catalog}/{bronze_db}/payer/files/diagnosis/diagnosis.csv")

shutil.copy(f"{base_path}/procedures/procedures.csv", f"/Volumes/{catalog}/{bronze_db}/payer/files/procedures/procedures.csv")
print(f"Copied to /Volumes/{catalog}/{bronze_db}/payer/files/procedures/procedures.csv")

shutil.copy(f"{base_path}/members/member.csv", f"/Volumes/{catalog}/{bronze_db}/payer/files/members/members.csv")
print(f"Copied to /Volumes/{catalog}/{bronze_db}/payer/files/members/members.csv")

shutil.copy(f"{base_path}/providers/providers.csv", f"/Volumes/{catalog}/{bronze_db}/payer/files/providers/providers.csv")
print(f"Copied to /Volumes/{catalog}/{bronze_db}/payer/files/providers/providers.csv")

# 🚀 Let's Build Your First Data Pipeline!

---

## Workshop Roadmap

```
📥 Bronze Layer    →    🔧 Silver Layer    →    ⭐ Gold Layer    →    📊 Analytics
   (Raw Data)          (Cleaned Data)        (Business Tables)      (Insights)
```

In the following sections, we'll build a complete data pipeline following the **Medallion Architecture**:

1. **Bronze Layer**: Ingest raw CSV files into Delta tables
2. **Silver Layer**: Clean, deduplicate, and transform data
3. **Gold Layer**: Create enriched analytics tables
4. **Analytics**: Generate insights and visualizations

Let's get started! 🎉

In [0]:
# %sql
# -- Set the catalog and schema
# CREATE CATALOG IF NOT EXISTS my_catalog;
# USE CATALOG my_catalog;

# -- Create bronze schema
# CREATE SCHEMA IF NOT EXISTS payer_bronze;

# 📥 Bronze Layer – Ingest Raw Data

---

## What is the Bronze Layer?

The **Bronze Layer** is the landing zone for raw data. Here we:
- 📂 Load data "as-is" from source files (CSV, JSON, Parquet, etc.)
- 💾 Store in Delta Lake format for ACID transactions
- 📝 Apply minimal transformation (just schema inference)
- ⏱️ Keep historical data for audit and reprocessing

> **💡 Best Practice**: Use `COPY INTO` for incremental, idempotent loading. It automatically skips already-loaded files!

---



## Step 1: Verify Source Files

Let's first check that our source files are available:

In [0]:
%sql
LIST '/Volumes/my_catalog/payer_bronze/payer/files/claims/'

## Step 2: Load Data with COPY INTO

### 📖 Understanding COPY INTO

`COPY INTO` is Databricks' recommended command for loading data from cloud storage into Delta tables.

**Key Benefits:**
- ✅ **Idempotent**: Safely re-run without duplicating data
- ✅ **Incremental**: Only loads new files automatically
- ✅ **Schema Evolution**: Can merge new columns with `mergeSchema` option
- ✅ **Atomic**: Either succeeds completely or rolls back

**Syntax:**
```sql
COPY INTO <table_name>
FROM '<source_path>'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true')
COPY_OPTIONS('mergeSchema' = 'true')
```

📚 **Learn More:**
- [COPY INTO Documentation](https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/delta-copy-into)
- [COPY INTO Examples](https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/copy-into/)


### Loading Data with SQL

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.claims_raw;
COPY INTO payer_bronze.claims_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/claims/')
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true', 'force' = 'true');

-- NOTE: 'force = true' is used here for demo purposes only to reload all files every time. In production, omit this option so COPY INTO only processes new data files.

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.diagnosis_raw;
COPY INTO payer_bronze.diagnosis_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/diagnosis/')

FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.members_raw;
COPY INTO payer_bronze.members_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/members/')

FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.procedures_raw;
COPY INTO payer_bronze.procedures_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/procedures/')
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');

In [0]:
%sql
CREATE TABLE IF NOT EXISTS payer_bronze.providers_raw;
COPY INTO payer_bronze.providers_raw FROM
(SELECT
*
FROM '/Volumes/my_catalog/payer_bronze/payer/files/providers/')
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true',
               'inferSchema' = 'true',
               'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


### 🐍 Alternative: Loading Data with PySpark

While SQL is great for batch loading, PySpark gives you more programmatic control. Here's how to load the same data using PySpark:

In [0]:
# Example: Load data using PySpark
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, DateType

# Option 1: Let Spark infer the schema
claims_df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/Volumes/my_catalog/payer_bronze/payer/files/claims/")

# Display first 10 rows
display(claims_df.limit(10))

# Show schema
print("Claims Schema:")
claims_df.printSchema()

# Get row count
print(f"\nTotal rows loaded: {claims_df.count()}")

# Write to Delta table (this creates or replaces the table)
# claims_df.write \
#     .format("delta") \
#     .mode("overwrite") \
#     .saveAsTable("payer_bronze.claims_raw_pyspark")


### 🎯 Exercise: Query Bronze Tables

Now that we've loaded data into Bronze tables, let's explore what we have:

**Try these queries yourself:**


In [0]:
# Query: Count records in each bronze table using PySpark
from pyspark.sql.functions import lit, count

tables = ['claims_raw', 'members_raw', 'providers_raw', 'diagnosis_raw', 'procedures_raw']
row_counts = []

for table in tables:
    cnt = spark.table(f"payer_bronze.{table}").count()
    row_counts.append((table, cnt))
    
# Create DataFrame to display results
result_df = spark.createDataFrame(row_counts, ["table_name", "row_count"])
display(result_df)


# 🔧 Silver Layer – Transform, Clean, and Join

---

## What is the Silver Layer?

The **Silver Layer** is where we transform raw data into clean, validated, and enriched datasets. Here we:

- 🧹 **Clean**: Remove nulls, trim whitespace, fix data quality issues
- 🔄 **Transform**: Cast data types, standardize formats
- 🗑️ **Deduplicate**: Remove duplicate records based on business keys
- 🔍 **Validate**: Apply business rules and data quality checks
- 📊 **Enrich**: Join related tables, calculate derived columns

> **💡 Best Practice**: Silver tables should be "analytics-ready" – cleaned, validated, and properly typed!



## Step 1: Transform Bronze to Silver (SQL)

Let's clean and transform our Bronze tables. We'll demonstrate with multiple examples using both **SQL** and **PySpark**.

In [0]:
%sql
-- Create silver schema
CREATE SCHEMA IF NOT EXISTS payer_silver;


-- Members: select relevant fields, cast types, remove duplicates
CREATE OR REPLACE TABLE payer_silver.members AS
SELECT
  DISTINCT CAST(member_id AS STRING) AS member_id,
  TRIM(first_name) AS first_name,
  TRIM(last_name) AS last_name,
  CAST(birth_date AS DATE) AS birth_date,
  gender,
  plan_id,
  CAST(effective_date AS DATE) AS effective_date
FROM payer_bronze.members_raw
WHERE member_id IS NOT NULL;


-- Claims: remove duplicates, prepare data
CREATE OR REPLACE TABLE payer_silver.claims AS
SELECT
  DISTINCT claim_id,
  member_id,
  provider_id,
  CAST(claim_date AS DATE) AS claim_date,
  ROUND(total_charge, 2) AS total_charge,
  LOWER(claim_status) AS claim_status
FROM payer_bronze.claims_raw
WHERE claim_id IS NOT NULL AND total_charge > 0;


-- Providers: deduplicate
CREATE OR REPLACE TABLE payer_silver.providers AS
SELECT
  DISTINCT provider_id,
  npi,
  provider_name,
  specialty,
  address,
  city,
  state
FROM payer_bronze.providers_raw
WHERE provider_id IS NOT NULL;


## Step 2: Transform with PySpark

Now let's see how to do the same transformations using PySpark. This approach is more flexible for complex business logic.

### Example: Transform Procedures Table with PySpark


In [0]:
from pyspark.sql.functions import col, trim, upper, round as spark_round, when, regexp_replace

# Read from Bronze
procedures_bronze = spark.table("payer_bronze.procedures_raw")

# Clean and cast the amount column
procedures_bronze_clean = procedures_bronze.withColumn(
    "amount_clean",
    regexp_replace(col("amount"), "[^0-9.]", "").cast("double")
)

# Apply transformations
procedures_silver = procedures_bronze_clean \
    .dropDuplicates(['claim_id', 'procedure_code']) \
    .filter(col("claim_id").isNotNull()) \
    .filter(col("amount_clean") > 0) \
    .select(
        col("claim_id"),
        upper(trim(col("procedure_code"))).alias("procedure_code"),
        trim(col("procedure_desc")).alias("procedure_desc"),
        spark_round(col("amount_clean"), 2).alias("amount"),
        when(col("amount_clean") < 100, "Low")
        .when(col("amount_clean") < 500, "Medium")
        .when(col("amount_clean") < 1000, "High")
        .otherwise("Very High").alias("cost_category")
    )

# Show sample data
print("Transformed Procedures (first 10 rows):")
display(procedures_silver.limit(10))

# Show statistics
print("\nCost Category Distribution:")
display(procedures_silver.groupBy("cost_category").count().orderBy("cost_category"))

# Write to Silver table
procedures_silver.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("payer_silver.procedures")

### 🎯 Exercise: Data Quality Checks

Let's verify our Silver transformations worked correctly:


In [0]:
from pyspark.sql.functions import col, sum as spark_sum

# Data Quality Check 1: Check for nulls in key columns
print("=== NULL CHECK ===")
members_df = spark.table("payer_silver.members")
null_counts = members_df.select([
    spark_sum(col(c).isNull().cast("int")).alias(c)
    for c in members_df.columns
])
display(null_counts)

# Data Quality Check 2: Check for duplicates
print("\n=== DUPLICATE CHECK ===")
claims_df = spark.table("payer_silver.claims")
total_rows = claims_df.count()
distinct_rows = claims_df.select("claim_id").distinct().count()
print(f"Total rows: {total_rows}")
print(f"Distinct claim_ids: {distinct_rows}")
print(f"Duplicates: {total_rows - distinct_rows}")

# Data Quality Check 3: Value range checks
print("\n=== VALUE RANGE CHECK ===")
claims_stats = claims_df.agg(
    {"total_charge": "min", "total_charge": "max", "total_charge": "avg"}
)
display(claims_stats)

# ⭐ Gold Layer – Actuarial Analytics (The Fun Part!)

---

## 🎯 What is the Gold Layer? (For Actuaries)

**This is where YOU spend most of your time!** The Gold Layer is like your **final SAS analysis datasets** - ready for actuarial work.

### What We'll Build (Actuarial Examples):

1. **📊 Loss Ratios** - By state, specialty, plan (like your quarterly reports)
2. **📈 Claims Trends** - Frequency & severity trending (PROC EXPAND style)
3. **🔺 Development Triangles** - Age-to-age factors (for reserving)
4. **💰 Premium Adequacy** - Earned vs Incurred ratios
5. **🎲 Risk Scoring** - Member risk segmentation
6. **📉 IBNR Indicators** - Late claims reporting patterns

---

## 🔄 How This Compares to SAS

| **Your SAS Workflow** | **In Databricks Gold Layer** |
|----------------------|------------------------------|
| Create final analysis dataset | Create Gold table |
| PROC SQL with aggregations | SQL SELECT with GROUP BY |
| PROC MEANS for summary stats | Aggregate functions (AVG, SUM, etc.) |
| Multiple DATA steps for calcs | Single SQL statement with CTEs |
| Macros for repeated calcs | Parameterized queries |
| Export to Excel for viz | Built-in interactive charts! |

---

## 🎓 Your Actuarial Toolbox

Today you'll learn SQL equivalents for common actuarial analyses:

- **Loss Ratios**: `SUM(claims)/SUM(premium)`
- **Trending**: Window functions (`LAG`, `LEAD`)
- **Development Factors**: `GROUP BY` claim year + development period
- **Percentiles**: `PERCENTILE_CONT` function
- **Risk Scores**: `CASE WHEN` logic

**Ready?** Let's start building! 🚀

---


## 📊 Actuarial Example 1: Loss Ratios by Segment

### 🎯 Business Question
**"What is our loss ratio by provider specialty and state?"**

This is a **fundamental actuarial metric** - you probably calculate this quarterly or annually!

### 📝 SAS vs Databricks

**In SAS, you might write:**
```sas
PROC SQL;
    CREATE TABLE loss_ratios AS
    SELECT 
        specialty,
        state,
        COUNT(*) as claim_count,
        SUM(total_charge) as incurred_losses,
        CALCULATED incurred_losses / (SELECT SUM(premium) FROM policies) as loss_ratio
    FROM claims c
    LEFT JOIN providers p ON c.provider_id = p.provider_id
    GROUP BY specialty, state;
QUIT;
```

**In Databricks, we write:**
```sql
-- Very similar! Most SQL transfers directly.
```

Let's build this now! 👇


In [None]:
%sql
-- 📊 ACTUARIAL ANALYSIS: Loss Ratios by Specialty and State
-- This calculates incurred claims / exposure (using claim count as proxy for premium)

CREATE OR REPLACE TABLE payer_gold.loss_ratios_by_segment AS
SELECT 
    p.specialty,
    p.state,
    COUNT(DISTINCT c.claim_id) AS claim_count,
    COUNT(DISTINCT c.member_id) AS member_count,
    SUM(c.total_charge) AS total_incurred,
    ROUND(AVG(c.total_charge), 2) AS avg_claim_size,
    
    -- Loss Ratio Calculation (using member count as premium proxy)
    -- In real life, you'd join to a premium table!
    ROUND(SUM(c.total_charge) / COUNT(DISTINCT c.member_id), 2) AS loss_ratio_per_member,
    
    -- Claim Frequency (claims per member)
    ROUND(COUNT(c.claim_id) * 1.0 / COUNT(DISTINCT c.member_id), 2) AS frequency,
    
    -- Severity (avg cost per claim)
    ROUND(SUM(c.total_charge) / COUNT(c.claim_id), 2) AS severity
    
FROM payer_silver.claims c
INNER JOIN payer_silver.providers p ON c.provider_id = p.provider_id
GROUP BY p.specialty, p.state
HAVING COUNT(c.claim_id) >= 5  -- Filter out small segments
ORDER BY total_incurred DESC;

-- Display results
SELECT * FROM payer_gold.loss_ratios_by_segment LIMIT 20;


### 🎯 YOUR TURN! Exercise 1: Specialty Analysis

**Task**: Modify the query above to answer: **"Which specialty has the highest severity (cost per claim)?"**

**Hints:**
1. You already have the `severity` column!
2. Just change the `ORDER BY` clause
3. Maybe add `LIMIT 10` to see top specialties

**Try it below:** (Click the cell and modify the SQL)

```sql
-- YOUR CODE HERE
-- Hint: Copy the query above and modify the ORDER BY


```

**Expected Output**: You should see specialties ranked by average claim cost.

---


## 📈 Actuarial Example 2: Claims Trending Analysis

### 🎯 Business Question
**"What are our monthly claim trends? Are claims trending up or down?"**

This is crucial for:
- **Rate making** (applying trend factors)
- **Reserving** (projecting ultimate losses)
- **Budgeting** (forecasting next year's costs)

### 📝 What We're Calculating

```
Month-over-Month Growth = (This Month - Last Month) / Last Month
Year-over-Year Growth = (This Month - Same Month Last Year) / Same Month Last Year
```

### 🔧 SAS Equivalent
In SAS, you might use **PROC EXPAND** or **LAG functions** in a DATA step. 

In Databricks, we use **Window Functions** - specifically `LAG()` and `LEAD()`.

Let's build it! 👇


In [None]:
%sql
-- 📈 ACTUARIAL ANALYSIS: Monthly Claims Trending
-- Window functions for MoM and YoY calculations

CREATE OR REPLACE TABLE payer_gold.claims_trend_analysis AS
WITH monthly_claims AS (
    -- Step 1: Aggregate claims by month
    SELECT 
        DATE_TRUNC('MONTH', claim_date) AS claim_month,
        YEAR(claim_date) AS claim_year,
        MONTH(claim_date) AS claim_month_num,
        COUNT(*) AS claim_count,
        SUM(total_charge) AS total_incurred,
        ROUND(AVG(total_charge), 2) AS avg_claim_cost
    FROM payer_silver.claims
    GROUP BY claim_month, claim_year, claim_month_num
)
SELECT 
    claim_month,
    claim_count,
    total_incurred,
    avg_claim_cost,
    
    -- Month-over-Month Comparison
    LAG(total_incurred, 1) OVER (ORDER BY claim_month) AS prior_month_incurred,
    ROUND((total_incurred - LAG(total_incurred, 1) OVER (ORDER BY claim_month)) / 
          LAG(total_incurred, 1) OVER (ORDER BY claim_month) * 100, 2) AS mom_growth_pct,
    
    -- Year-over-Year Comparison (12 months ago)
    LAG(total_incurred, 12) OVER (ORDER BY claim_month) AS prior_year_incurred,
    ROUND((total_incurred - LAG(total_incurred, 12) OVER (ORDER BY claim_month)) / 
          LAG(total_incurred, 12) OVER (ORDER BY claim_month) * 100, 2) AS yoy_growth_pct,
    
    -- 3-Month Moving Average (for smoothing)
    ROUND(AVG(total_incurred) OVER (ORDER BY claim_month 
                                     ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), 2) AS moving_avg_3mo
    
FROM monthly_claims
ORDER BY claim_month;

-- Display the results
SELECT * FROM payer_gold.claims_trend_analysis;


## 🔺 Actuarial Example 3: Claims Development Triangle

### 🎯 Business Question
**"How do claims develop over time? What are our age-to-age factors?"**

This is **THE fundamental tool** for actuarial reserving!

### 📝 What We're Building

A **development triangle** shows:
- **Rows**: Accident/Policy Year
- **Columns**: Development Period (months since occurrence)
- **Values**: Cumulative claims at each development point

Then we calculate:
- **Age-to-Age Factors** (e.g., 12-to-24 month factor)
- **Ultimate Loss Projections**

### 🔧 Why This Matters
- **IBNR Reserves**: Estimate unreported claims
- **Case Reserve Adequacy**: Check if reserves are sufficient
- **Rate Adequacy**: Are our prices sufficient?

Let's build a simple development view! 👇


In [None]:
%sql
-- 🔺 ACTUARIAL ANALYSIS: Claims Development Pattern
-- Shows how claims emerge over time (by months since occurrence)

CREATE OR REPLACE TABLE payer_gold.claims_development AS
WITH claim_development AS (
    SELECT 
        c.claim_id,
        DATE_TRUNC('YEAR', c.claim_date) AS accident_year,
        c.claim_date,
        c.total_charge,
        m.effective_date AS member_effective_date,
        
        -- Development period in months (months between member effective date and claim date)
        -- This simulates months since policy inception
        DATEDIFF(MONTH, m.effective_date, c.claim_date) AS development_months,
        
        -- Group into development periods (quarterly for simplicity)
        CASE 
            WHEN DATEDIFF(MONTH, m.effective_date, c.claim_date) <= 3 THEN '0-3 months'
            WHEN DATEDIFF(MONTH, m.effective_date, c.claim_date) <= 6 THEN '4-6 months'
            WHEN DATEDIFF(MONTH, m.effective_date, c.claim_date) <= 12 THEN '7-12 months'
            WHEN DATEDIFF(MONTH, m.effective_date, c.claim_date) <= 24 THEN '13-24 months'
            ELSE '24+ months'
        END AS development_period
        
    FROM payer_silver.claims c
    INNER JOIN payer_silver.members m ON c.member_id = m.member_id
    WHERE c.claim_date >= m.effective_date  -- Claims after member enrollment
)
SELECT 
    accident_year,
    development_period,
    COUNT(*) AS claim_count,
    SUM(total_charge) AS cumulative_incurred,
    ROUND(AVG(total_charge), 2) AS avg_claim_size,
    
    -- Calculate % of claims reported in each period
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY accident_year), 2) AS pct_of_total_claims
    
FROM claim_development
GROUP BY accident_year, development_period
ORDER BY accident_year, development_period;

-- Display the triangle
SELECT * FROM payer_gold.claims_development;


## 🎯 YOUR TURN! Exercise 2: Development Analysis

**Task**: Calculate the age-to-age development factor from 0-3 months to 4-6 months.

**What You Need:**
1. Sum of cumulative incurred for "0-3 months"
2. Sum of cumulative incurred for "4-6 months"  
3. Factor = (4-6 months) / (0-3 months)

**Hint**: Use the table we just created!

```sql
-- YOUR CODE HERE
SELECT 
    '0-3 to 4-6 month factor' AS factor_name,
    -- Calculate the ratio here
FROM payer_gold.claims_development
WHERE development_period IN ('0-3 months', '4-6 months');
```

**Why This Matters**: Development factors help you project ultimate losses!

---


# 📖 SAS to Databricks Quick Reference for Actuaries

This section shows you **side-by-side comparisons** of common SAS code and Databricks equivalents.

---

## 1️⃣ Basic Data Aggregation

### SAS: PROC MEANS
```sas
PROC MEANS DATA=claims NOPRINT;
    CLASS specialty;
    VAR total_charge;
    OUTPUT OUT=summary 
        N=claim_count 
        SUM=total_incurred 
        MEAN=avg_claim;
RUN;
```

### Databricks: GROUP BY
```sql
SELECT 
    specialty,
    COUNT(*) AS claim_count,
    SUM(total_charge) AS total_incurred,
    AVG(total_charge) AS avg_claim
FROM claims
GROUP BY specialty;
```

**🎯 Key Difference**: In Databricks, it's all in one SELECT statement!

---

## 2️⃣ Frequency Tables

### SAS: PROC FREQ
```sas
PROC FREQ DATA=claims;
    TABLES specialty * state / NOCOL NOROW;
RUN;
```

### Databricks: GROUP BY with COUNT
```sql
SELECT 
    specialty,
    state,
    COUNT(*) AS frequency
FROM claims
GROUP BY specialty, state
ORDER BY frequency DESC;
```

---

## 3️⃣ Conditional Logic

### SAS: DATA Step with IF-THEN
```sas
DATA claims_categorized;
    SET claims;
    IF total_charge < 1000 THEN risk_category = 'Low';
    ELSE IF total_charge < 5000 THEN risk_category = 'Medium';
    ELSE risk_category = 'High';
RUN;
```

### Databricks: CASE WHEN
```sql
SELECT 
    *,
    CASE 
        WHEN total_charge < 1000 THEN 'Low'
        WHEN total_charge < 5000 THEN 'Medium'
        ELSE 'High'
    END AS risk_category
FROM claims;
```

---

## 4️⃣ Joining Tables

### SAS: PROC SQL Join
```sas
PROC SQL;
    CREATE TABLE enriched_claims AS
    SELECT c.*, p.specialty, p.provider_name
    FROM claims c
    LEFT JOIN providers p 
        ON c.provider_id = p.provider_id;
QUIT;
```

### Databricks: SQL Join (Identical!)
```sql
SELECT c.*, p.specialty, p.provider_name
FROM claims c
LEFT JOIN providers p 
    ON c.provider_id = p.provider_id;
```

**🎯 Great News**: If you know SAS PROC SQL, you already know Databricks SQL!

---

## 5️⃣ Lagging and Leading Values (Trending)

### SAS: LAG Function
```sas
DATA trends;
    SET monthly_data;
    prior_month = LAG(total_incurred);
    growth_pct = (total_incurred - prior_month) / prior_month * 100;
RUN;
```

### Databricks: LAG Window Function
```sql
SELECT 
    *,
    LAG(total_incurred, 1) OVER (ORDER BY month) AS prior_month,
    ROUND((total_incurred - LAG(total_incurred, 1) OVER (ORDER BY month)) / 
          LAG(total_incurred, 1) OVER (ORDER BY month) * 100, 2) AS growth_pct
FROM monthly_data;
```

---

## 6️⃣ Percentiles and Quantiles

### SAS: PROC UNIVARIATE
```sas
PROC UNIVARIATE DATA=claims;
    VAR total_charge;
    OUTPUT OUT=percentiles 
        PCTLPTS=25 50 75 90 95 99
        PCTLPRE=P;
RUN;
```

### Databricks: PERCENTILE_CONT
```sql
SELECT 
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_charge) AS P25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY total_charge) AS P50,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_charge) AS P75,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY total_charge) AS P90,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_charge) AS P95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY total_charge) AS P99
FROM claims;
```

---

## 7️⃣ Moving Averages (Smoothing)

### SAS: Rolling Average
```sas
DATA moving_avg;
    SET monthly_data;
    avg_3mo = MEAN(total_incurred, LAG(total_incurred), LAG2(total_incurred));
RUN;
```

### Databricks: Window Function with ROWS
```sql
SELECT 
    *,
    AVG(total_incurred) OVER (
        ORDER BY month 
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS avg_3mo
FROM monthly_data;
```

---

## 🎓 Quick Translation Guide

| **SAS** | **Databricks** | **Notes** |
|---------|----------------|-----------|
| `PROC SQL` | SQL queries | Almost identical! |
| `PROC MEANS` | `GROUP BY` + aggregations | Very similar |
| `PROC FREQ` | `GROUP BY` + `COUNT()` | Same logic |
| `DATA` step | `SELECT` with transformations | Different syntax, same result |
| `LAG()` | `LAG() OVER (ORDER BY)` | Window function needed |
| `RETAIN` | Window functions | Use cumulative sums |
| `MERGE` | `JOIN` | SQL joins |
| `WHERE` | `WHERE` | Identical! |
| `IF-THEN-ELSE` | `CASE WHEN` | Different syntax |
| Macros | Widgets + parameters | Similar concept |

---

## 💡 Pro Tips for SAS Users

1. **PROC SQL knowledge transfers 90%**: If you're comfortable with SAS PROC SQL, you'll pick up Databricks quickly!

2. **Window functions = LAG/LEAD on steroids**: More powerful than SAS LAG functions.

3. **No DATA step needed**: Most transformations can be done in SQL with `CASE WHEN`.

4. **CTEs are your friend**: Use `WITH` clauses instead of creating intermediate datasets.

5. **Display > PROC PRINT**: Just use `display()` in Python cells or `SELECT` in SQL.

---


# 🎯 Interactive Exercises - Test Your Skills!

Now it's time to practice! These exercises cover common actuarial analyses.

---

## Exercise 1: Calculate Loss Ratios by Plan Type

**Objective**: Find which plan has the worst loss ratio (highest incurred per member).

**What to Calculate**:
- Claims count by plan_id
- Total incurred by plan_id
- Average claim per member by plan_id
- Sort by worst loss ratio first

**Starter Code**:
```sql
-- YOUR TURN! Complete this query
SELECT 
    m.plan_id,
    COUNT(c.claim_id) AS _____,
    SUM(c.total_charge) AS _____,
    COUNT(DISTINCT m.member_id) AS member_count,
    ROUND(SUM(c.total_charge) / COUNT(DISTINCT m.member_id), 2) AS loss_ratio_per_member
FROM payer_silver.claims c
INNER JOIN payer_silver.members m ON c.member_id = m.member_id
GROUP BY _____
ORDER BY _____ DESC;
```

**Hints**:
- Fill in the blanks (_____)
- Think about what metrics you need
- How should you order the results?

---

## Exercise 2: Identify High-Risk Members

**Objective**: Find members with unusually high claims (potential high-risk individuals).

**Criteria**:
- Members with total claims > 95th percentile
- Include member demographics
- Calculate their total incurred and claim count

**Starter Code**:
```sql
-- Step 1: Find the 95th percentile threshold
WITH percentile_threshold AS (
    SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_charge) AS p95
    FROM payer_silver.claims
),
-- Step 2: Identify high-cost members
high_cost_members AS (
    SELECT 
        member_id,
        COUNT(*) AS claim_count,
        SUM(total_charge) AS total_incurred
    FROM payer_silver.claims
    GROUP BY member_id
    HAVING SUM(total_charge) > (SELECT p95 FROM percentile_threshold)
)
-- Step 3: YOUR TURN - Join with member details and display results
SELECT 
    -- Add your columns here
FROM high_cost_members h
INNER JOIN payer_silver.members m ON h.member_id = m.member_id
ORDER BY total_incurred DESC;
```

**What This Tells You**: These are your high-risk members who need care management!

---

## Exercise 3: Month-over-Month Growth Rate

**Objective**: Calculate which month had the highest growth in claims.

**What to Find**:
- Monthly total incurred
- Prior month incurred (use LAG)
- Month-over-month growth %
- Which month had highest growth?

**Starter Code**:
```sql
WITH monthly_totals AS (
    SELECT 
        DATE_TRUNC('MONTH', claim_date) AS claim_month,
        SUM(total_charge) AS monthly_incurred
    FROM payer_silver.claims
    GROUP BY claim_month
)
SELECT 
    claim_month,
    monthly_incurred,
    -- YOUR TURN: Add LAG function to get prior month
    LAG(monthly_incurred, ___) OVER (ORDER BY ___) AS prior_month,
    -- YOUR TURN: Calculate growth percentage
    ROUND((monthly_incurred - LAG(_____, ___) OVER (ORDER BY _____)) / 
          LAG(_____, ___) OVER (ORDER BY _____) * 100, 2) AS mom_growth_pct
FROM monthly_totals
ORDER BY mom_growth_pct DESC  -- Highest growth first
LIMIT 10;
```

**Actuarial Insight**: Sudden spikes might indicate unusual events (epidemics, policy changes, fraud).

---

## Exercise 4: Provider Network Analysis

**Objective**: Which providers/specialties drive the most costs?

**Analysis Steps**:
1. Group by provider specialty
2. Calculate total incurred, claim count, avg claim size
3. Calculate % of total costs
4. Identify top cost drivers

**Starter Code**:
```sql
WITH specialty_costs AS (
    SELECT 
        p.specialty,
        COUNT(c.claim_id) AS claim_count,
        SUM(c.total_charge) AS total_incurred,
        ROUND(AVG(c.total_charge), 2) AS avg_claim_cost
    FROM payer_silver.claims c
    INNER JOIN payer_silver.providers p ON c.provider_id = p.provider_id
    GROUP BY p.specialty
)
SELECT 
    specialty,
    claim_count,
    total_incurred,
    avg_claim_cost,
    -- YOUR TURN: Calculate % of total costs
    ROUND(total_incurred * 100.0 / SUM(total_incurred) OVER (), 2) AS pct_of_total_cost
FROM specialty_costs
ORDER BY total_incurred DESC;
```

**Why This Matters**: Understanding cost drivers helps with:
- Network contracting
- Provider steerage
- Benefit design
- Rate setting

---

## 🏆 Challenge Exercise: Build Your Own Analysis!

Pick one of these actuarial questions and build the SQL to answer it:

1. **Geographic Analysis**: Which states have the best/worst loss ratios?
2. **Gender Analysis**: Do claims patterns differ by gender?
3. **Age Analysis**: How do claims trend with member age? (Use birth_date to calculate age)
4. **Seasonality**: Are there seasonal patterns in claims? (Monthly analysis)
5. **Large Claims**: Identify claims above $10,000 - what specialties are they from?

**Your Code Here**:
```sql
-- Pick a question and write your analysis!


```

---


# 🎓 Workshop Summary - You Did It!

## 🎉 Congratulations, Actuaries!

You've just completed your first Databricks workshop! Let's review what you learned.

---

## ✅ What You Accomplished Today

### 1. **Transitioned from SAS to Databricks** 🔄
- Learned how your SAS knowledge transfers
- Ran SQL queries (very similar to PROC SQL!)
- Used window functions (LAG, LEAD for trending)

### 2. **Built Actuarial Analytics** 📊
- ✅ **Loss Ratios** by segment
- ✅ **Claims Trending** (month-over-month growth)
- ✅ **Development Patterns** (claims emergence)
- ✅ **Risk Segmentation** (high-cost members)
- ✅ **Provider Analysis** (cost drivers)

### 3. **Learned Key SQL Techniques** 💻
- `GROUP BY` for aggregations (like PROC MEANS)
- `JOIN` for combining tables
- `CASE WHEN` for conditional logic (like IF-THEN)
- `LAG/LEAD` for trending (like SAS LAG functions)
- `PERCENTILE_CONT` for quantiles (like PROC UNIVARIATE)
- `WINDOW FUNCTIONS` for running calculations

---

## 🚀 How to Use This at Work

### Immediate Applications:

1. **Quarterly Loss Ratio Reports**
   - Use the loss ratio queries we built
   - Group by state, specialty, plan type
   - Export to dashboards (no more Excel!)

2. **Monthly Trending Analysis**
   - Monitor claims frequency and severity trends
   - Identify unusual spikes early
   - Feed into your pricing models

3. **Reserving Support**
   - Build development triangles
   - Calculate age-to-age factors
   - Track IBNR emergence patterns

4. **Risk Management**
   - Identify high-risk members
   - Segment populations for care management
   - Calculate risk scores

5. **Rate Filings**
   - Trend historical claims
   - Support rate change justifications
   - Build exhibits for regulators

---

## 📚 Next Steps for Actuaries

### Week 1: Practice the Basics
- [ ] Recreate one of your SAS reports in Databricks
- [ ] Try writing 5 simple SQL queries
- [ ] Calculate a loss ratio for your own data
- [ ] Create one visualization

### Week 2-4: Build Real Analyses
- [ ] Build a complete loss ratio analysis
- [ ] Create a development triangle
- [ ] Set up monthly trend monitoring
- [ ] Share a dashboard with your team

### Month 2-3: Advanced Topics
- [ ] Learn PySpark (for complex calculations)
- [ ] Automate your reports (scheduled notebooks)
- [ ] Build predictive models (ML)
- [ ] Create interactive dashboards (Databricks SQL)

---

## 🆘 Getting Help

### When You're Stuck:

1. **Use the AI Assistant** 🤖
   - Click the AI icon in any cell
   - Ask: "How do I calculate a loss ratio?"
   - Ask: "Convert this SAS code to SQL"

2. **Databricks Documentation**
   - [SQL Reference](https://docs.databricks.com/sql/language-manual/)
   - [Window Functions](https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html)
   - [Date Functions](https://docs.databricks.com/sql/language-manual/sql-ref-datetime-functions.html)

3. **Community Resources**
   - [Databricks Community Forums](https://community.databricks.com/)
   - [Stack Overflow - Databricks Tag](https://stackoverflow.com/questions/tagged/databricks)
   - Internal company Databricks experts

4. **Your Colleagues**
   - Share this notebook with other actuaries!
   - Form a study group
   - Practice together

---

## 💡 Key Takeaways

### 1. **You Already Know More Than You Think!**
If you know SAS PROC SQL, you know 90% of Databricks SQL. The syntax is almost identical!

### 2. **Start Simple**
Don't try to learn everything at once. Start with:
- Basic `SELECT` queries
- Simple aggregations (`GROUP BY`)
- Joins

Then gradually add:
- Window functions
- CTEs (WITH clauses)
- Advanced analytics

### 3. **SQL is Enough for Most Actuarial Work**
You don't need to learn Python/PySpark right away. Most actuarial analyses can be done with SQL alone!

### 4. **Iterate and Improve**
Your first queries won't be perfect. That's okay! 
- Start with something that works
- Refine it over time
- Ask for feedback

---

## 🎯 Your Action Plan

**This Week:**
1. Open Databricks and run your first query
2. Try recreating ONE simple SAS report
3. Share your success with a colleague!

**This Month:**
1. Build one complete analysis (loss ratios or trends)
2. Create one visualization
3. Teach someone else what you learned

**This Quarter:**
1. Migrate one regular report to Databricks
2. Learn one new technique (window functions, CTEs, etc.)
3. Explore PySpark basics (optional)

---

## 📖 Recommended Learning Resources

### For Actuaries:
1. **Databricks SQL Guide** (focus on aggregations and window functions)
2. **Healthcare Analytics Examples** (look for payer-specific use cases)
3. **SAS to SQL Translation Guides** (many available online)

### Courses:
1. **Databricks Fundamentals** (free online course)
2. **SQL for Data Analysis** (many free resources)
3. **Healthcare Analytics on Databricks** (industry-specific)

### Books:
1. "SQL for Data Analytics" (O'Reilly)
2. "Databricks Certified Data Analyst Study Guide"

---

## 🙏 Thank You!

Thank you for participating in this workshop! Remember:
- **Be patient with yourself** - Learning new tools takes time
- **Practice regularly** - Try to use Databricks weekly
- **Ask questions** - No question is too simple
- **Help others** - Teaching reinforces your own learning

---

## 🎁 Bonus: Cheat Sheet for Your Desk

```
COMMON ACTUARIAL QUERIES:

1. Loss Ratio:
   SELECT segment, SUM(claims)/SUM(premium) 
   FROM data GROUP BY segment;

2. Trending:
   SELECT month, value, LAG(value) OVER (ORDER BY month)
   FROM data;

3. Percentiles:
   SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY claim)
   FROM claims;

4. Moving Average:
   SELECT date, AVG(value) OVER (ORDER BY date 
          ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
   FROM data;

5. Development Factor:
   SELECT period, SUM(claims), 
          SUM(claims)/LAG(SUM(claims)) OVER (ORDER BY period)
   FROM triangle GROUP BY period;
```

---

## 📞 Stay Connected

- **Questions?** Ask your workshop instructor or Databricks team
- **Want More?** Let us know what topics you'd like to see
- **Success Stories?** Share how you're using Databricks!

---

**Happy Analyzing!** 🚀📊🎯

*Remember: Every expert was once a beginner. You've got this!*

---


## Step 1: Create Enriched Tables with SQL

In [0]:
%sql
-- Create gold schema
CREATE SCHEMA IF NOT EXISTS payer_gold;

-- Gold: Claims with member and provider details
CREATE OR REPLACE TABLE payer_gold.claims_enriched AS
SELECT
  c.claim_id,
  c.claim_date,
  c.total_charge,
  c.claim_status,
  m.member_id,
  m.first_name,
  m.last_name,
  m.gender,
  m.plan_id,
  p.provider_id,
  p.provider_name,
  p.specialty,
  p.city,
  p.state
FROM payer_silver.claims c
INNER JOIN payer_silver.members m ON c.member_id = m.member_id
INNER JOIN payer_silver.providers p ON c.provider_id = p.provider_id;


-- Gold: Claim Aggregates per Member
CREATE OR REPLACE TABLE payer_gold.member_claim_summary AS
SELECT
  member_id,
  COUNT(DISTINCT claim_id) AS total_claims,
  SUM(total_charge) AS sum_claims,
  MAX(total_charge) AS max_claim,
  MIN(total_charge) AS min_claim
FROM payer_silver.claims
GROUP BY member_id;


## Step 2: Create Advanced Analytics with PySpark

Let's create more sophisticated analytics tables using PySpark:

### Example 1: Provider Performance Dashboard

In [0]:
from pyspark.sql.functions import sum, avg, count, min, max, stddev, countDistinct
from pyspark.sql.window import Window

# Load tables
claims = spark.table("payer_silver.claims")
providers = spark.table("payer_silver.providers")
members = spark.table("payer_silver.members")

# Create provider performance metrics
provider_performance = claims.join(providers, "provider_id") \
    .groupBy(
        "provider_id", 
        "provider_name", 
        "specialty", 
        "city", 
        "state"
    ) \
    .agg(
        count("claim_id").alias("total_claims"),
        countDistinct("member_id").alias("unique_patients"),
        sum("total_charge").alias("total_revenue"),
        avg("total_charge").alias("avg_claim_amount"),
        min("total_charge").alias("min_claim_amount"),
        max("total_charge").alias("max_claim_amount"),
        stddev("total_charge").alias("stddev_claim_amount")
    ) \
    .orderBy(col("total_revenue").desc())

print("Top 10 Providers by Revenue:")
display(provider_performance.limit(10))

# Save to Gold table
provider_performance.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("payer_gold.provider_performance")


### Example 2: Time-Series Analysis - Monthly Claims Trends


In [0]:
from pyspark.sql.functions import year, month, date_format, lag, round as spark_round
from pyspark.sql.window import Window

# Create monthly trends
monthly_claims = claims \
    .withColumn("year", year("claim_date")) \
    .withColumn("month", month("claim_date")) \
    .withColumn("month_year", date_format("claim_date", "yyyy-MM")) \
    .groupBy("year", "month", "month_year", "claim_status") \
    .agg(
        count("claim_id").alias("claim_count"),
        sum("total_charge").alias("total_charges"),
        avg("total_charge").alias("avg_charge")
    ) \
    .orderBy("year", "month", "claim_status")

# Calculate month-over-month growth
window_spec = Window.partitionBy("claim_status").orderBy("year", "month")

monthly_trends = monthly_claims \
    .withColumn("prev_month_charges", lag("total_charges").over(window_spec)) \
    .withColumn(
        "mom_growth_pct", 
        spark_round(
            ((col("total_charges") - col("prev_month_charges")) / col("prev_month_charges") * 100), 
            2
        )
    )

print("Monthly Claims Trends with Growth:")
display(monthly_trends)

# Save to Gold
monthly_trends.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("payer_gold.monthly_claims_trends")


# 🤖 Using Databricks AI Assistant

---

Databricks AI Assistant can help you write code, understand data, and troubleshoot issues!

### How to Use AI Assistant:
1. Click the AI Assistant icon
2. Ask questions in natural language
3. Get code suggestions and explanations

### Example Prompts to Try:
- "What kind of aggregations can I do with table payer_gold.claims_enriched?"
- "How do I calculate the total claims by specialty?"
- "Show me how to create a window function for running totals"
- "What does spark.table() command do?"
- "Help me debug this PySpark error"

---



# 📊 Analytics & Visualization

---

## Introduction to Databricks Visualizations

Databricks provides powerful built-in visualization capabilities. You can create:
- 📊 Bar charts, line charts, scatter plots
- 🗺️ Geographic maps
- 📈 Histograms and box plots
- 🥧 Pie charts and area charts

> **💡 Tip**: Use `display()` function to automatically generate interactive visualizations!

---

## Statistical Analysis with PySpark

Let's explore our data using statistical analysis and visualizations:

In [0]:
display(spark.table("payer_gold.claims_enriched"))

## Example Visualizations


### 1. Claims by Status (Bar Chart)

In [0]:
from pyspark.sql.functions import sum, count, avg

# Aggregate total charges by claim status
claims_by_status = spark.table("payer_gold.claims_enriched") \
    .groupBy("claim_status") \
    .agg(
        sum("total_charge").alias("total_charges"),
        count("claim_id").alias("claim_count"),
        avg("total_charge").alias("avg_charge")
    ) \
    .orderBy("total_charges", ascending=False)

# Visualize as a bar chart: X-axis = claim_status, Y-axis = total_charges
display(claims_by_status)

# Tip: In the visualization options, select "Bar Chart", set X-axis to 'claim_status', Y-axis to 'total_charges'

Databricks visualization. Run in Databricks to view.

### 2. Gender Distribution (Pie Chart)


In [0]:
# Analyze claims distribution by gender
gender_analysis = spark.table("payer_gold.claims_enriched") \
    .groupBy("gender") \
    .agg(
        count("claim_id").alias("total_claims"),
        sum("total_charge").alias("total_charges"),
        avg("total_charge").alias("avg_charge_per_claim")
    )

print("👥 Claims Analysis by Gender:")
display(gender_analysis)

# Tip: Try changing the visualization to a Pie Chart to see the distribution!

Databricks visualization. Run in Databricks to view.

### 3. Time Series - Claims Over Time (Line Chart)


In [0]:
# Time series analysis of claims
time_series = spark.table("payer_gold.claims_enriched") \
    .groupBy("claim_date") \
    .agg(
        sum("total_charge").alias("daily_charges"),
        count("claim_id").alias("daily_claim_count")
    ) \
    .orderBy("claim_date")

print("📈 Daily Claims Trends:")
display(time_series)

# Tip: Select Line Chart visualization with claim_date as X-axis and daily_charges as Y-axis!

Databricks visualization. Run in Databricks to view.

### 4. Geographic Analysis - Claims by City (Map/Bar Chart)


In [0]:
# Geographic distribution of claims
city_analysis = spark.table("payer_gold.claims_enriched") \
    .groupBy("city", "state") \
    .agg(
        count("claim_id").alias("total_claims"),
        sum("total_charge").alias("total_charges"),
        countDistinct("member_id").alias("unique_members")
    ) \
    .orderBy("total_charges", ascending=False)

print("🗺️ Claims Distribution by City:")
display(city_analysis.limit(20))

# Tip: Try Map visualization if your data has proper geographic fields!

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

### 5. Distribution Analysis - Charge Amount Histogram


In [0]:
# Analyze the distribution of claim charges
charges = spark.table("payer_gold.claims_enriched").select("total_charge")

# Get statistical summary
print("📊 Statistical Summary of Claim Charges:")
charges.describe().show()

# Display histogram
print("\n💵 Charge Distribution (Histogram):")
display(charges)

# Tip: Select Histogram visualization to see the distribution of charges!
# You can adjust the number of bins for better granularity.

Databricks visualization. Run in Databricks to view.

### 6. Correlation Analysis - Scatter Plot


In [0]:
# Scatter plot to find relationships between variables
scatter_data = spark.table("payer_gold.claims_enriched") \
    .select("claim_date", "total_charge", "claim_status")

print("📍 Scatter Plot: Claim Date vs Charge Amount")
display(scatter_data)

# Tip: Select Scatter Plot visualization
# X-axis: claim_date, Y-axis: total_charge
# Group by: claim_status for color coding

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.


## 🎯 Advanced Analytics Example: Cohort Analysis

Let's create a more complex analysis - member cohort analysis based on their enrollment date:


In [0]:
from pyspark.sql.functions import datediff, months_between, floor

# Join members with their claims
member_claims = spark.table("payer_silver.members") \
    .join(spark.table("payer_silver.claims"), "member_id", "left")

# Create cohorts based on enrollment month
cohort_analysis = member_claims \
    .withColumn("enrollment_month", date_format("effective_date", "yyyy-MM")) \
    .withColumn("claim_month", date_format("claim_date", "yyyy-MM")) \
    .withColumn(
        "months_since_enrollment", 
        floor(months_between("claim_date", "effective_date"))
    ) \
    .groupBy("enrollment_month", "months_since_enrollment") \
    .agg(
        countDistinct("member_id").alias("active_members"),
        count("claim_id").alias("total_claims"),
        sum("total_charge").alias("total_charges")
    ) \
    .filter(col("months_since_enrollment").isNotNull()) \
    .orderBy("enrollment_month", "months_since_enrollment")

print("📅 Member Cohort Analysis:")
display(cohort_analysis)

# Save as a Gold table for future analysis
cohort_analysis.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("payer_gold.member_cohort_analysis")


# AI/BI

Intelligent analytics for everyone!

Databricks AI/BI is a new type of business intelligence product designed to provide a deep understanding of your data's semantics, enabling self-service data analysis for everyone in your organization. AI/BI is built on a compound AI system that draws insights from the full lifecycle of your data across the Databricks platform, including ETL pipelines, lineage, and other queries.

<img src="https://www.databricks.com/sites/default/files/2025-05/hero-image-ai-bi-v2-2x.png?v=1748417271" alt="Managed Tables" width="600" height="500">

# Genie

Talk with your data

Now everyone can get insights from data simply by asking questions in natural language.

<img src="https://www.databricks.com/sites/default/files/2025-06/ai-bi-genie-hero.png?v=1749162682" alt="Managed Tables" width="600" height="500">


# 📚 Best Practices & Performance Tips

## 🚀 Performance Optimization

### 1. **Use Partitioning for Large Tables**
```python
# Partition by date for time-series data
df.write \
    .format("delta") \
    .partitionBy("claim_date") \
    .saveAsTable("payer_gold.claims_partitioned")
```

### 2. **Enable Z-Ordering for Common Filters**
```sql
OPTIMIZE payer_gold.claims_enriched
ZORDER BY (member_id, claim_date);
```

### 3. **Use Caching for Frequently Accessed DataFrames**
```python
claims_df = spark.table("payer_silver.claims").cache()
# Now use claims_df multiple times without re-reading
```

### 4. **Broadcast Small Tables in Joins**
```python
from pyspark.sql.functions import broadcast

large_df.join(broadcast(small_df), "key")
```

---

## 🔒 Data Quality Best Practices

### 1. **Always Validate Data**
```python
# Add constraints
spark.sql("""
    ALTER TABLE payer_silver.claims 
    ADD CONSTRAINT valid_charge CHECK (total_charge > 0)
""")
```

### 2. **Use Schema Evolution Carefully**
```python
# Explicitly define schema for production
from pyspark.sql.types import *

schema = StructType([
    StructField("claim_id", StringType(), False),
    StructField("total_charge", DoubleType(), True),
    # ... more fields
])
```

### 3. **Implement Data Quality Checks**
```python
def validate_claims(df):
    """Run data quality checks"""
    checks = {
        "null_claim_ids": df.filter(col("claim_id").isNull()).count(),
        "negative_charges": df.filter(col("total_charge") < 0).count(),
        "future_dates": df.filter(col("claim_date") > current_date()).count()
    }
    return checks
```

---

## 💾 Delta Lake Best Practices

### 1. **Regular Maintenance**
```sql
-- Compact small files
OPTIMIZE payer_gold.claims_enriched;

-- Remove old versions (keep 7 days)
VACUUM payer_gold.claims_enriched RETAIN 168 HOURS;

-- Update statistics
ANALYZE TABLE payer_gold.claims_enriched COMPUTE STATISTICS;
```

### 2. **Use Time Travel for Auditing**
```sql
-- Query previous version
SELECT * FROM payer_gold.claims_enriched VERSION AS OF 1;

-- Query as of timestamp
SELECT * FROM payer_gold.claims_enriched TIMESTAMP AS OF '2025-01-01';
```

### 3. **Enable Change Data Feed**
```sql
ALTER TABLE payer_gold.claims_enriched 
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
```

---

## 🏗️ Architecture Best Practices

### 1. **Medallion Layer Guidelines**
- **Bronze**: Keep all source data, minimal transformation
- **Silver**: One source system = one silver table (usually)
- **Gold**: Many silver tables → one gold table (join/aggregate)

### 2. **Naming Conventions**
```
Catalog: <organization>_<environment>
Schema: <domain>_<layer>
Table: <entity>_<descriptor>

Examples:
- acme_prod.payer_bronze.claims_raw
- acme_dev.payer_silver.claims_cleaned
- acme_prod.payer_gold.member_360_view
```

### 3. **Documentation**
```sql
-- Add table comments
COMMENT ON TABLE payer_gold.claims_enriched IS 
'Enriched claims with member and provider details for analytics';

-- Add column comments
ALTER TABLE payer_gold.claims_enriched 
ALTER COLUMN total_charge COMMENT 'Total charged amount in USD';
```

---


# 🎓 Workshop Summary & Next Steps

## 🎉 Congratulations!

You've completed the Databricks Healthcare Payer Analytics Workshop! Let's review what you learned:

---

## ✅ What You Accomplished

### 1. **Medallion Architecture**
- ✅ Built a complete **Bronze → Silver → Gold** pipeline
- ✅ Understood data quality improvement at each layer
- ✅ Created analytics-ready datasets

### 2. **Data Engineering Skills**
- ✅ Loaded data using **COPY INTO**
- ✅ Transformed data with **SQL and PySpark**
- ✅ Applied data quality checks and validations
- ✅ Created aggregations and derived metrics

### 3. **Analytics & Visualization**
- ✅ Generated business insights from data
- ✅ Created interactive visualizations
- ✅ Performed statistical analysis
- ✅ Built executive dashboards

### 4. **Databricks Platform**
- ✅ Worked with **Unity Catalog**
- ✅ Used **Delta Lake** for reliable data storage
- ✅ Leveraged **AI Assistant** for code help
- ✅ Applied performance optimization techniques

---

## 🚀 Next Steps

### Immediate Actions
1. ⭐ **Bookmark this notebook** for future reference
2. 📖 Complete the hands-on exercises
3. 🔄 Try modifying the code with your own logic
4. 💾 Export your results and share with your team

### Continue Learning

#### 📚 Advanced Topics to Explore
- **Delta Live Tables (DLT)**: Declarative pipeline framework
- **Databricks Workflows**: Orchestration and scheduling
- **Unity Catalog**: Advanced governance features
- **Databricks SQL**: Performance tuning and optimization
- **Machine Learning**: MLflow and Feature Store
- **Streaming**: Structured Streaming with Delta Lake

#### 🔗 Helpful Resources
- [Databricks Documentation](https://docs.databricks.com/)
- [Delta Lake Guide](https://docs.delta.io/)
- [Databricks Academy](https://www.databricks.com/learn/training)
- [Community Forums](https://community.databricks.com/)
- [Databricks Blog](https://www.databricks.com/blog)

#### 🎯 Recommended Certifications
- **Databricks Lakehouse Platform Fundamentals**
- **Databricks Certified Data Engineer Associate**
- **Databricks Certified Data Analyst Associate**

---

## 💡 Real-World Applications

Apply these skills to:
- 🏥 **Healthcare**: Claims processing, patient analytics, risk scoring
- 🏦 **Finance**: Fraud detection, transaction analysis, risk management
- 🛒 **Retail**: Customer analytics, inventory optimization, sales forecasting
- 📱 **Technology**: User behavior analysis, product metrics, A/B testing

---

## 🤝 Get Help & Share

### Need Help?
- 💬 Ask the **Databricks AI Assistant**
- 🌐 Visit [Databricks Community](https://community.databricks.com/)
- 📧 Contact your Databricks account team
- 📖 Check [Stack Overflow](https://stackoverflow.com/questions/tagged/databricks)

### Share Your Success
- ⭐ Share insights with your colleagues
- 📊 Create dashboards for stakeholders
- 🎤 Present your work at team meetings
- 🏆 Contribute to the Databricks community

---

## 📝 Feedback

We'd love to hear your thoughts on this workshop!

**What worked well?** What could be improved? **What topics do you want to learn next?**

---

## 🙏 Thank You!

Thank you for participating in this workshop. We hope you found it valuable and are excited to continue your Databricks journey! 🚀

---

*Last Updated: October 7, 2025*


# 📖 Quick Reference Guide

## Common PySpark Operations

### Reading Data
```python
# From Delta table
df = spark.table("catalog.schema.table")

# From CSV
df = spark.read.format("csv").option("header", "true").load("path/to/file.csv")

# From JSON
df = spark.read.json("path/to/file.json")

# From Parquet
df = spark.read.parquet("path/to/file.parquet")
```

### Writing Data
```python
# Write to Delta table
df.write.format("delta").mode("overwrite").saveAsTable("table_name")

# Append mode
df.write.format("delta").mode("append").saveAsTable("table_name")

# With partitioning
df.write.format("delta").partitionBy("date_col").saveAsTable("table_name")
```

### Common Transformations
```python
from pyspark.sql.functions import *

# Select columns
df.select("col1", "col2")

# Filter rows
df.filter(col("amount") > 100)
df.where("amount > 100")

# Add new column
df.withColumn("new_col", col("old_col") * 2)

# Rename column
df.withColumnRenamed("old_name", "new_name")

# Drop column
df.drop("col_name")

# Group by and aggregate
df.groupBy("category").agg(
    count("*").alias("count"),
    sum("amount").alias("total"),
    avg("amount").alias("average")
)

# Join tables
df1.join(df2, "key_column")
df1.join(df2, df1.key == df2.key, "left")

# Sort
df.orderBy("col_name")
df.orderBy(col("col_name").desc())

# Remove duplicates
df.dropDuplicates()
df.dropDuplicates(["col1", "col2"])
```

### Common Functions
```python
# String functions
upper("col_name")
lower("col_name")
trim("col_name")
concat("col1", "col2")
substring("col_name", 1, 5)

# Date functions
current_date()
current_timestamp()
date_format("date_col", "yyyy-MM-dd")
year("date_col")
month("date_col")
datediff("date1", "date2")

# Math functions
round("col_name", 2)
abs("col_name")
ceil("col_name")
floor("col_name")

# Conditional logic
when(col("amount") > 100, "High").otherwise("Low")

# Null handling
col("col_name").isNull()
col("col_name").isNotNull()
coalesce("col1", "col2", lit(0))
```

## Common SQL Operations

### DDL Commands
```sql
-- Create database
CREATE DATABASE IF NOT EXISTS database_name;

-- Drop database
DROP DATABASE IF EXISTS database_name CASCADE;

-- Create table
CREATE TABLE table_name (
    id STRING,
    amount DOUBLE,
    date DATE
);

-- Drop table
DROP TABLE IF EXISTS table_name;

-- Describe table
DESCRIBE EXTENDED table_name;
SHOW COLUMNS FROM table_name;
```

### DML Commands
```sql
-- Insert data
INSERT INTO table_name VALUES (1, 'value1', 100);

-- Update data (Delta Lake)
UPDATE table_name SET amount = 200 WHERE id = 1;

-- Delete data (Delta Lake)
DELETE FROM table_name WHERE id = 1;

-- Merge (Upsert)
MERGE INTO target_table
USING source_table
ON target_table.id = source_table.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
```

### Query Commands
```sql
-- Basic SELECT
SELECT * FROM table_name LIMIT 10;

-- With WHERE clause
SELECT * FROM table_name WHERE amount > 100;

-- Aggregations
SELECT category, COUNT(*), SUM(amount), AVG(amount)
FROM table_name
GROUP BY category;

-- Joins
SELECT a.*, b.name
FROM table_a a
INNER JOIN table_b b ON a.id = b.id;

-- Window functions
SELECT 
    *,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY amount DESC) as rank
FROM table_name;

-- CTE (Common Table Expression)
WITH summary AS (
    SELECT category, SUM(amount) as total
    FROM table_name
    GROUP BY category
)
SELECT * FROM summary WHERE total > 1000;
```

## Databricks Utilities
```python
# File system operations
dbutils.fs.ls("path/")
dbutils.fs.cp("source", "destination")
dbutils.fs.rm("path/", recurse=True)
dbutils.fs.mkdirs("path/")

# Widgets (parameters)
dbutils.widgets.text("param_name", "default_value")
param_value = dbutils.widgets.get("param_name")

# Notebooks
dbutils.notebook.run("notebook_path", timeout_seconds, {"param": "value"})
```

---

*Keep this reference handy as you build your data pipelines!*
