# üìñ SAS to Databricks Quick Reference for Actuaries

This section shows you **side-by-side comparisons** of common SAS code and Databricks equivalents.

---

## 1Ô∏è‚É£ Basic Data Aggregation

### SAS: PROC MEANS
```sas
PROC MEANS DATA=claims NOPRINT;
    CLASS specialty;
    VAR total_charge;
    OUTPUT OUT=summary 
        N=claim_count 
        SUM=total_incurred 
        MEAN=avg_claim;
RUN;
```

### Databricks: GROUP BY
```sql
SELECT 
    specialty,
    COUNT(*) AS claim_count,
    SUM(total_charge) AS total_incurred,
    AVG(total_charge) AS avg_claim
FROM claims
GROUP BY specialty;
```

**üéØ Key Difference**: In Databricks, it's all in one SELECT statement!

---

## 2Ô∏è‚É£ Frequency Tables

### SAS: PROC FREQ
```sas
PROC FREQ DATA=claims;
    TABLES specialty * state / NOCOL NOROW;
RUN;
```

### Databricks: GROUP BY with COUNT
```sql
SELECT 
    specialty,
    state,
    COUNT(*) AS frequency
FROM claims
GROUP BY specialty, state
ORDER BY frequency DESC;
```

---

## 3Ô∏è‚É£ Conditional Logic

### SAS: DATA Step with IF-THEN
```sas
DATA claims_categorized;
    SET claims;
    IF total_charge < 1000 THEN risk_category = 'Low';
    ELSE IF total_charge < 5000 THEN risk_category = 'Medium';
    ELSE risk_category = 'High';
RUN;
```

### Databricks: CASE WHEN
```sql
SELECT 
    *,
    CASE 
        WHEN total_charge < 1000 THEN 'Low'
        WHEN total_charge < 5000 THEN 'Medium'
        ELSE 'High'
    END AS risk_category
FROM claims;
```

---

## 4Ô∏è‚É£ Joining Tables

### SAS: PROC SQL Join
```sas
PROC SQL;
    CREATE TABLE enriched_claims AS
    SELECT c.*, p.specialty, p.provider_name
    FROM claims c
    LEFT JOIN providers p 
        ON c.provider_id = p.provider_id;
QUIT;
```

### Databricks: SQL Join (Identical!)
```sql
SELECT c.*, p.specialty, p.provider_name
FROM claims c
LEFT JOIN providers p 
    ON c.provider_id = p.provider_id;
```

**üéØ Great News**: If you know SAS PROC SQL, you already know Databricks SQL!

---

## 5Ô∏è‚É£ Lagging and Leading Values (Trending)

### SAS: LAG Function
```sas
DATA trends;
    SET monthly_data;
    prior_month = LAG(total_incurred);
    growth_pct = (total_incurred - prior_month) / prior_month * 100;
RUN;
```

### Databricks: LAG Window Function
```sql
SELECT 
    *,
    LAG(total_incurred, 1) OVER (ORDER BY month) AS prior_month,
    ROUND((total_incurred - LAG(total_incurred, 1) OVER (ORDER BY month)) / 
          LAG(total_incurred, 1) OVER (ORDER BY month) * 100, 2) AS growth_pct
FROM monthly_data;
```

---

## 6Ô∏è‚É£ Percentiles and Quantiles

### SAS: PROC UNIVARIATE
```sas
PROC UNIVARIATE DATA=claims;
    VAR total_charge;
    OUTPUT OUT=percentiles 
        PCTLPTS=25 50 75 90 95 99
        PCTLPRE=P;
RUN;
```

### Databricks: PERCENTILE_CONT
```sql
SELECT 
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_charge) AS P25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY total_charge) AS P50,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_charge) AS P75,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY total_charge) AS P90,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_charge) AS P95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY total_charge) AS P99
FROM claims;
```

---

## 7Ô∏è‚É£ Moving Averages (Smoothing)

### SAS: Rolling Average
```sas
DATA moving_avg;
    SET monthly_data;
    avg_3mo = MEAN(total_incurred, LAG(total_incurred), LAG2(total_incurred));
RUN;
```

### Databricks: Window Function with ROWS
```sql
SELECT 
    *,
    AVG(total_incurred) OVER (
        ORDER BY month 
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS avg_3mo
FROM monthly_data;
```

---

## üéì Quick Translation Guide

| **SAS** | **Databricks** | **Notes** |
|---------|----------------|-----------|
| `PROC SQL` | SQL queries | Almost identical! |
| `PROC MEANS` | `GROUP BY` + aggregations | Very similar |
| `PROC FREQ` | `GROUP BY` + `COUNT()` | Same logic |
| `DATA` step | `SELECT` with transformations | Different syntax, same result |
| `LAG()` | `LAG() OVER (ORDER BY)` | Window function needed |
| `RETAIN` | Window functions | Use cumulative sums |
| `MERGE` | `JOIN` | SQL joins |
| `WHERE` | `WHERE` | Identical! |
| `IF-THEN-ELSE` | `CASE WHEN` | Different syntax |
| Macros | Widgets + parameters | Similar concept |

---

## üí° Pro Tips for SAS Users

1. **PROC SQL knowledge transfers 90%**: If you're comfortable with SAS PROC SQL, you'll pick up Databricks quickly!

2. **Window functions = LAG/LEAD on steroids**: More powerful than SAS LAG functions.

3. **No DATA step needed**: Most transformations can be done in SQL with `CASE WHEN`.

4. **CTEs are your friend**: Use `WITH` clauses instead of creating intermediate datasets.

5. **Display > PROC PRINT**: Just use `display()` in Python cells or `SELECT` in SQL.

---


# üìö Best Practices & Performance Tips

## üöÄ Performance Optimization

### 1. **Use Partitioning for Large Tables**
```python
# Partition by date for time-series data
df.write \
    .format("delta") \
    .partitionBy("claim_date") \
    .saveAsTable("payer_gold.claims_partitioned")
```

### 2. **Enable Z-Ordering for Common Filters**
```sql
OPTIMIZE payer_gold.claims_enriched
ZORDER BY (member_id, claim_date);
```

### 3. **Use Caching for Frequently Accessed DataFrames**
```python
claims_df = spark.table("payer_silver.claims").cache()
# Now use claims_df multiple times without re-reading
```

### 4. **Broadcast Small Tables in Joins**
```python
from pyspark.sql.functions import broadcast

large_df.join(broadcast(small_df), "key")
```

---

## üîí Data Quality Best Practices

### 1. **Always Validate Data**
```python
# Add constraints
spark.sql("""
    ALTER TABLE payer_silver.claims 
    ADD CONSTRAINT valid_charge CHECK (total_charge > 0)
""")
```

### 2. **Use Schema Evolution Carefully**
```python
# Explicitly define schema for production
from pyspark.sql.types import *

schema = StructType([
    StructField("claim_id", StringType(), False),
    StructField("total_charge", DoubleType(), True),
    # ... more fields
])
```

### 3. **Implement Data Quality Checks**
```python
def validate_claims(df):
    """Run data quality checks"""
    checks = {
        "null_claim_ids": df.filter(col("claim_id").isNull()).count(),
        "negative_charges": df.filter(col("total_charge") < 0).count(),
        "future_dates": df.filter(col("claim_date") > current_date()).count()
    }
    return checks
```

---

## üíæ Delta Lake Best Practices

### 1. **Regular Maintenance**
```sql
-- Compact small files
OPTIMIZE payer_gold.claims_enriched;

-- Remove old versions (keep 7 days)
VACUUM payer_gold.claims_enriched RETAIN 168 HOURS;

-- Update statistics
ANALYZE TABLE payer_gold.claims_enriched COMPUTE STATISTICS;
```

### 2. **Use Time Travel for Auditing**
```sql
-- Query previous version
SELECT * FROM payer_gold.claims_enriched VERSION AS OF 1;

-- Query as of timestamp
SELECT * FROM payer_gold.claims_enriched TIMESTAMP AS OF '2025-01-01';
```

### 3. **Enable Change Data Feed**
```sql
ALTER TABLE payer_gold.claims_enriched 
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
```

---

## üèóÔ∏è Architecture Best Practices

### 1. **Medallion Layer Guidelines**
- **Bronze**: Keep all source data, minimal transformation
- **Silver**: One source system = one silver table (usually)
- **Gold**: Many silver tables ‚Üí one gold table (join/aggregate)

### 2. **Naming Conventions**
```
Catalog: <organization>_<environment>
Schema: <domain>_<layer>
Table: <entity>_<descriptor>

Examples:
- acme_prod.payer_bronze.claims_raw
- acme_dev.payer_silver.claims_cleaned
- acme_prod.payer_gold.member_360_view
```

### 3. **Documentation**
```sql
-- Add table comments
COMMENT ON TABLE payer_gold.claims_enriched IS 
'Enriched claims with member and provider details for analytics';

-- Add column comments
ALTER TABLE payer_gold.claims_enriched 
ALTER COLUMN total_charge COMMENT 'Total charged amount in USD';
```

---


# üìñ Quick Reference Guide - SQL & PySpark

## Common PySpark Operations

### Reading Data
```python
# From Delta table
df = spark.table("catalog.schema.table")

# From CSV
df = spark.read.format("csv").option("header", "true").load("path/to/file.csv")

# From JSON
df = spark.read.json("path/to/file.json")

# From Parquet
df = spark.read.parquet("path/to/file.parquet")
```

### Writing Data
```python
# Write to Delta table
df.write.format("delta").mode("overwrite").saveAsTable("table_name")

# Append mode
df.write.format("delta").mode("append").saveAsTable("table_name")

# With partitioning
df.write.format("delta").partitionBy("date_col").saveAsTable("table_name")
```

### Common Transformations
```python
from pyspark.sql.functions import *

# Select columns
df.select("col1", "col2")

# Filter rows
df.filter(col("amount") > 100)
df.where("amount > 100")

# Add new column
df.withColumn("new_col", col("old_col") * 2)

# Rename column
df.withColumnRenamed("old_name", "new_name")

# Drop column
df.drop("col_name")

# Group by and aggregate
df.groupBy("category").agg(
    count("*").alias("count"),
    sum("amount").alias("total"),
    avg("amount").alias("average")
)

# Join tables
df1.join(df2, "key_column")
df1.join(df2, df1.key == df2.key, "left")

# Sort
df.orderBy("col_name")
df.orderBy(col("col_name").desc())

# Remove duplicates
df.dropDuplicates()
df.dropDuplicates(["col1", "col2"])
```

### Common Functions
```python
# String functions
upper("col_name")
lower("col_name")
trim("col_name")
concat("col1", "col2")
substring("col_name", 1, 5)

# Date functions
current_date()
current_timestamp()
date_format("date_col", "yyyy-MM-dd")
year("date_col")
month("date_col")
datediff("date1", "date2")

# Math functions
round("col_name", 2)
abs("col_name")
ceil("col_name")
floor("col_name")

# Conditional logic
when(col("amount") > 100, "High").otherwise("Low")

# Null handling
col("col_name").isNull()
col("col_name").isNotNull()
coalesce("col1", "col2", lit(0))
```

## Common SQL Operations

### DDL Commands
```sql
-- Create database
CREATE DATABASE IF NOT EXISTS database_name;

-- Drop database
DROP DATABASE IF EXISTS database_name CASCADE;

-- Create table
CREATE TABLE table_name (
    id STRING,
    amount DOUBLE,
    date DATE
);

-- Drop table
DROP TABLE IF EXISTS table_name;

-- Describe table
DESCRIBE EXTENDED table_name;
SHOW COLUMNS FROM table_name;
```

### DML Commands
```sql
-- Insert data
INSERT INTO table_name VALUES (1, 'value1', 100);

-- Update data (Delta Lake)
UPDATE table_name SET amount = 200 WHERE id = 1;

-- Delete data (Delta Lake)
DELETE FROM table_name WHERE id = 1;

-- Merge (Upsert)
MERGE INTO target_table
USING source_table
ON target_table.id = source_table.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
```

### Query Commands
```sql
-- Basic SELECT
SELECT * FROM table_name LIMIT 10;

-- With WHERE clause
SELECT * FROM table_name WHERE amount > 100;

-- Aggregations
SELECT category, COUNT(*), SUM(amount), AVG(amount)
FROM table_name
GROUP BY category;

-- Joins
SELECT a.*, b.name
FROM table_a a
INNER JOIN table_b b ON a.id = b.id;

-- Window functions
SELECT 
    *,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY amount DESC) as rank
FROM table_name;

-- CTE (Common Table Expression)
WITH summary AS (
    SELECT category, SUM(amount) as total
    FROM table_name
    GROUP BY category
)
SELECT * FROM summary WHERE total > 1000;
```

## Databricks Utilities
```python
# File system operations
dbutils.fs.ls("path/")
dbutils.fs.cp("source", "destination")
dbutils.fs.rm("path/", recurse=True)
dbutils.fs.mkdirs("path/")

# Widgets (parameters)
dbutils.widgets.text("param_name", "default_value")
param_value = dbutils.widgets.get("param_name")

# Notebooks
dbutils.notebook.run("notebook_path", timeout_seconds, {"param": "value"})
```

---


# üìö Lazy Evaluation & Deterministic Execution - Quick Reference

---

### ‚úÖ Best Practices Checklist for HQRI Gold Layer Pipelines

| Scenario | Recommended Approach | Why? |
|----------|---------------------|------|
| **Interactive analysis** on same dataset | `.cache()` or `.persist()` | Fast in-memory reuse |
| **Production ETL** between pipeline stages | `.write.saveAsTable()` | Durable, auditable, time travel |
| **Multiple branches** from same transformation | `.cache()` + `.count()` | Forces evaluation once, reuse many times |
| **Very wide transformations** (many joins) | Write to Delta checkpoint | Breaks lineage, fault-tolerant |
| **Data quality validation** | `.count()` after each stage | Forces execution, verifies row counts |
| **Reusable business logic** | CREATE VIEW | Dynamic, always reflects latest data |
| **Materialized analytics tables** | CREATE TABLE AS SELECT | Static, fast queries, predictable |

---

### üö® Common Pitfalls to Avoid

#### ‚ùå Pitfall 1: Recomputing expensive transformations
```python
# BAD: Recomputes join 3 times
df = members.join(claims, "member_id")
count1 = df.count()
count2 = df.filter(col("age") > 65).count()
count3 = df.groupBy("plan_id").count().count()

# GOOD: Cache and reuse
df = members.join(claims, "member_id").cache()
df.count()  # Trigger caching
count1 = df.count()
count2 = df.filter(col("age") > 65).count()
count3 = df.groupBy("plan_id").count().count()
df.unpersist()
```

#### ‚ùå Pitfall 2: Not forcing evaluation at checkpoints
```python
# BAD: Error only surfaces at the end
df1 = spark.table("table1").filter(col("bad_column") == 1)  # Typo in column name
df2 = df1.join(other_table, "id")
df3 = df2.groupBy("plan_id").count()
df3.write.saveAsTable("output")  # ERROR HERE - hard to debug!

# GOOD: Validate early
df1 = spark.table("table1").filter(col("bad_column") == 1)
print(f"Stage 1: {df1.count()} rows")  # ERROR HERE - easy to debug!
df2 = df1.join(other_table, "id")
print(f"Stage 2: {df2.count()} rows")
df3 = df2.groupBy("plan_id").count()
df3.write.saveAsTable("output")
```

#### ‚ùå Pitfall 3: Caching too early or too much
```python
# BAD: Caching before filtering (wastes memory)
df = spark.table("claims").cache()
df.count()
filtered = df.filter(col("total_charge") > 1000)

# GOOD: Filter first, then cache
df = spark.table("claims").filter(col("total_charge") > 1000).cache()
df.count()
```

---

### üéØ Decision Tree: When to Use Each Technique

```
Start: Do you need the result more than once?
‚îÇ
‚îú‚îÄ NO ‚Üí Don't cache, let Spark optimize
‚îÇ
‚îî‚îÄ YES ‚Üí Is this a production pipeline?
    ‚îÇ
    ‚îú‚îÄ YES ‚Üí Write to Delta table (.saveAsTable)
    ‚îÇ        ‚Ä¢ Durable across cluster restarts
    ‚îÇ        ‚Ä¢ Enables time travel
    ‚îÇ        ‚Ä¢ Can be read by other systems
    ‚îÇ
    ‚îî‚îÄ NO ‚Üí Is it used in the same session?
        ‚îÇ
        ‚îú‚îÄ YES ‚Üí Use .cache() or .persist()
        ‚îÇ        ‚Ä¢ Fast in-memory access
        ‚îÇ        ‚Ä¢ Remember to .unpersist() when done
        ‚îÇ
        ‚îî‚îÄ NO ‚Üí Create a VIEW
                 ‚Ä¢ Reusable across sessions
                 ‚Ä¢ Dynamic (always fresh data)
```

---

### üìñ Summary: Lazy Evaluation Principles

1. **Transformations are lazy** - build execution plan without running
2. **Actions are eager** - trigger execution of entire plan
3. **Cache for reuse** - avoid recomputing expensive transformations
4. **Write checkpoints** - break lineage and ensure durability
5. **Force evaluation** - use `.count()` to validate at each stage
6. **Clean up** - `.unpersist()` to free memory

---


*Keep this reference handy as you build your data pipelines!*
