# Modern Approaches to Struct Aggregation in Databricks

## Working with Nested Healthcare Data: From Deprecated to Modern SQL

This notebook demonstrates the evolution from deprecated `LATERAL VIEW` syntax to modern table-valued function approaches in Databricks Runtime 12.2+. We'll explore different methods for aggregating nested struct data using a healthcare claims example.

### 📋 What You'll Learn:
- ❌ Why `LATERAL VIEW` is deprecated and what to use instead
- ✅ Modern table-valued function syntax for exploding arrays
- 🔄 Traditional aggregation vs. higher-order functions
- 🏗️ How to rebuild structs with calculated values

### 🏥 Use Case: Healthcare Claims Processing
We'll work with a realistic nested data structure representing medical claims with header information and detailed line items.


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Initialize Spark session
spark = SparkSession.builder.appName("BlogValidationTest").getOrCreate()


## 1. Sample Data Setup

Let's create a realistic healthcare claims dataset with nested structures. Each claim has:
- **Header information**: claim ID, line of business, total charges
- **Detail array**: individual charge line items with amounts and units


In [2]:
# Create the nested data structure matching the blog screenshot
medicaid_data = [
    {
        "claimHeader": {
            "claimId": "ABC123456789",
            "lineOfBusiness": "Medicaid",
            "totalCharges": 3.25
        },
        "claimDetail": [
            {"chargeAmount": 1.25, "units": 1.00},
            {"chargeAmount": 2.00, "units": 1.00}
        ]
    }
]

# Define the schema
schema = StructType([
    StructField("claimHeader", StructType([
        StructField("claimId", StringType(), True),
        StructField("lineOfBusiness", StringType(), True),
        StructField("totalCharges", DoubleType(), True)
    ]), True),
    StructField("claimDetail", ArrayType(StructType([
        StructField("chargeAmount", DoubleType(), True),
        StructField("units", DoubleType(), True)
    ])), True)
])

# Create DataFrame
myClaimsTable = spark.createDataFrame(medicaid_data, schema)
myClaimsTable.createOrReplaceTempView("myClaimsTable")

print("Created test data:")
myClaimsTable.show(truncate=False)
myClaimsTable.printSchema()


Created test data:
+------------------------------+-------------------------+
|claimHeader                   |claimDetail              |
+------------------------------+-------------------------+
|{ABC123456789, Medicaid, 3.25}|[{1.25, 1.0}, {2.0, 1.0}]|
+------------------------------+-------------------------+

root
 |-- claimHeader: struct (nullable = true)
 |    |-- claimId: string (nullable = true)
 |    |-- lineOfBusiness: string (nullable = true)
 |    |-- totalCharges: double (nullable = true)
 |-- claimDetail: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- chargeAmount: double (nullable = true)
 |    |    |-- units: double (nullable = true)



## 2. The Deprecated Approach: LATERAL VIEW

⚠️ **Important**: `LATERAL VIEW` is deprecated in Databricks Runtime 12.2+ but still works. This section shows the old syntax for comparison.


In [3]:

%sql
-- DEPRECATED SYNTAX (still works but not recommended)
SELECT 
    claimHeader.claimId,
    claimHeader.lineOfBusiness,
    d.chargeAmount,
    d.units
FROM myClaimsTable
LATERAL VIEW explode(claimDetail) AS d


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,claimId,lineOfBusiness,chargeAmount,units
0,ABC123456789,Medicaid,1.25,1.0
1,ABC123456789,Medicaid,2.0,1.0


## 3. Modern Approach: Table-Valued Functions ✅

**Recommended for Databricks Runtime 12.2+**

This is the modern way to explode arrays using table-valued functions as table references:


In [4]:
%sql
-- MODERN SYNTAX (Recommended for Runtime 12.2+)
SELECT 
    claimHeader.claimId,
    claimHeader.lineOfBusiness,
    d.col.chargeAmount,
    d.col.units
FROM myClaimsTable,
    LATERAL explode(claimDetail) AS d


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,claimId,lineOfBusiness,chargeAmount,units
0,ABC123456789,Medicaid,1.25,1.0
1,ABC123456789,Medicaid,2.0,1.0


## 4. Aggregation with Modern Syntax

Now let's calculate totals by aggregating the exploded data. This is useful when you need to recalculate totals or perform analytics on individual line items:


In [5]:
%sql
SELECT
  claimHeader.claimId,
  SUM(detail.col.chargeAmount * detail.col.units) AS claimTotal
FROM myClaimsTable,
  LATERAL explode(claimDetail) AS detail
GROUP BY claimHeader.claimId


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,claimId,claimTotal
0,ABC123456789,3.25


## 5. Higher-Order Functions: The Most Efficient Approach 🚀

For better performance and cleaner code, use higher-order functions like `aggregate()`. This avoids exploding rows entirely:


In [6]:
%sql
SELECT 
    claimHeader.claimId,
    claimHeader.lineOfBusiness,
    claimHeader.totalCharges as original_totalCharges,
    aggregate(claimDetail, CAST(0.0 AS DOUBLE), (acc, detail) -> acc + detail.chargeAmount * detail.units) as calculated_totalCharges
FROM myClaimsTable


Unnamed: 0,claimId,lineOfBusiness,original_totalCharges,calculated_totalCharges
0,ABC123456789,Medicaid,3.25,3.25


## 6. Rebuilding Structs with Calculated Values

Often you need to update nested structures with calculated values. Here's how to rebuild the original struct with the corrected total:


In [7]:
%sql
SELECT 
    struct(
        claimHeader.claimId,
        claimHeader.lineOfBusiness,
        aggregate(claimDetail, CAST(0.0 AS DOUBLE), (acc, detail) -> acc + detail.chargeAmount * detail.units)
    ) as claimHeader,
    claimDetail
FROM myClaimsTable


Unnamed: 0,claimHeader,claimDetail
0,"{'claimId': 'ABC123456789', 'lineOfBusiness': 'Medicaid', 'col3': 3.25}","[{'chargeAmount': 1.25, 'units': 1.0}, {'chargeAmount': 2.0, 'units': 1.0}]"


## 📊 Performance & Approach Comparison

### **Approach Recommendations:**

| Method | Use Case | Performance | Complexity |
|--------|----------|-------------|------------|
| **Higher-Order Functions** | Simple aggregations, large datasets | ⭐⭐⭐ Best | ⭐⭐ Medium |
| **Modern LATERAL** | Complex queries, familiar SQL patterns | ⭐⭐ Good | ⭐ Easy |
| **LATERAL VIEW** | Legacy codebases only | ⭐⭐ Good | ⭐ Easy |

### **Key Syntax Evolution:**

```sql
-- ❌ DEPRECATED (Databricks Runtime 12.2+)
FROM myClaimsTable
LATERAL VIEW explode(claimDetail) AS d

-- ✅ MODERN (Recommended)
FROM myClaimsTable,
    LATERAL explode(claimDetail) AS d

-- 🚀 BEST (Higher-order functions)
aggregate(claimDetail, 0.0, (acc, detail) -> acc + detail.chargeAmount * detail.units)
```

## 🎯 Best Practices & Recommendations

1. **For New Development**: Use higher-order functions (`aggregate`, `transform`, `filter`)
2. **For Migration**: Replace `LATERAL VIEW` with modern table-valued function syntax
3. **For Performance**: Higher-order functions avoid row explosion and are more efficient
4. **For Readability**: Choose the approach your team is most comfortable with

## 📚 References

- [Databricks LATERAL VIEW Documentation](https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-syntax-qry-select-lateral-view)
- [Modern Table Reference Syntax](https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-syntax-qry-select-table-reference)
- [Higher-Order Functions in Spark SQL](https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#higher-order-functions)

---

**💡 Pro Tip**: Start with higher-order functions for new projects. They're more performant and represent the future direction of Spark SQL development.
