# Spark 4.0 Variant Data Type Analysis - Complete Use Cases

## Overview
This notebook demonstrates the power of Apache Spark 4.0's new **Variant data type** for processing heterogeneous, semi-structured data across three real-world use cases:

### 🎯 **Use Cases Covered**
1. **🛒 E-commerce Event Analytics** - Multi-event type analysis (purchases, searches, wishlists)
2. **🏭 IoT Sensor Processing** - Oil rig sensor data with diverse sensor types and readings
3. **🛡️ Security Log Analysis** - Multi-system security event correlation (Firewall, Antivirus, IDS)

### 🔧 **Key Features**
- **Unified Data Processing**: Handle diverse JSON structures with a single data type
- **Performance Optimized**: CTE-based queries for distributed processing
- **Real-World Patterns**: Industry-validated analysis techniques
- **Mixed API Approach**: DataFrame API + SQL for optimal performance

### 📊 **Technical Highlights**
- **Variant Data Type**: Native JSON processing without schema constraints
- **VARIANT_GET()**: Type-safe data extraction from nested JSON
- **parse_json()**: DataFrame API JSON parsing
- **Cross-System Correlation**: Multi-source data analysis

---

**Authors**: Jules S. Damji & Cursor AI  
**Requirements**: Apache Spark 4.0+ with Variant support  
**Dataset Size**: Configurable (default: 60K records per use case)


In [None]:
# Import utility functions and required libraries
import time
from pyspark.sql.functions import col, parse_json

# Import all data generation utilities
from data_utility import (
    generate_ecommerce_data,
    generate_oil_rig_data, 
    generate_security_data
)

# Import individual use case runners
from ecommerce_event_analytics import run_ecommerce_analysis
from iot_sensor_processing import run_iot_analysis  
from security_log_analysis import run_security_analysis

print("✅ All utility functions imported successfully")
print(f"Spark Version: {spark.version}")
print(f"Available Spark session: {spark}")


## 🛒 Use Case 1: E-commerce Event Analytics

Analyze heterogeneous e-commerce events (purchases, searches, wishlists) using Variant data type for unified processing.

### Key Analysis Areas:
- Event type distribution and patterns
- Purchase analysis by category and revenue
- Search behavior and user preferences
- User engagement across event types


In [None]:
# Run E-commerce Analysis
# Adjust num_records as needed (default: 60000 for full analysis, 10000 for quick demo)
run_ecommerce_analysis(num_records=10000)


## 🏭 Use Case 2: IoT Sensor Processing

Process diverse IoT sensor data from oil rig operations with different sensor types and measurement structures.

### Key Analysis Areas:
- Sensor type distribution and health monitoring
- Critical alerts and anomaly detection
- Location-based analysis and trends
- Equipment monitoring and maintenance insights


In [None]:
# Run IoT Sensor Analysis
# Adjust num_records as needed (default: 60000 for full analysis, 10000 for quick demo)
run_iot_analysis(num_records=10000)


## 🛡️ Use Case 3: Security Log Analysis

Analyze heterogeneous security logs from multiple systems (Firewall, Antivirus, IDS) for comprehensive threat detection.

### Key Analysis Areas:
- Multi-system security event correlation
- Geographic threat intelligence
- Severity-based threat prioritization
- Cross-system IP correlation for advanced threat detection

**Note**: All security analysis patterns have been validated against real-world SIEM practices.


In [None]:
# Run Security Log Analysis
# Adjust num_records as needed (default: 60000 for full analysis, 10000 for quick demo)
run_security_analysis(num_records=10000)


## 🎯 Interactive Use Case Runner

Run specific use cases with custom parameters or run all use cases for a comprehensive demo.


In [None]:
# Interactive runner - uncomment and customize as needed

# Option 1: Run specific use case with custom parameters
# run_ecommerce_analysis(num_records=25000)
# run_iot_analysis(num_records=15000)
# run_security_analysis(num_records=30000)

# Option 2: Quick demo of all use cases
print("🚀 Running quick demo of all use cases...")
print("\n" + "="*80)
print("QUICK DEMO: ALL SPARK VARIANT USE CASES")
print("="*80)

# Run all with smaller datasets for quick demonstration
run_ecommerce_analysis(5000)
print("\n" + "="*80)
run_iot_analysis(5000)
print("\n" + "="*80) 
run_security_analysis(5000)

print("\n" + "="*80)
print("✅ ALL USE CASES COMPLETED SUCCESSFULLY!")
print("="*80)


## 🔧 Available Utility Functions Reference

### Data Generation Functions:

```python
# Main data generation functions (imported from data_utility)
generate_ecommerce_data(num_records=1000)
generate_oil_rig_data(num_records=1000) 
generate_security_data(num_records=1000)
```

### Analysis Functions:

```python
# Complete analysis runners (imported from respective modules)
run_ecommerce_analysis(num_records=60000)
run_iot_analysis(num_records=60000)
run_security_analysis(num_records=60000)
```

### Key Spark Variant Functions:

```sql
-- Convert JSON string to Variant
parse_json(column_name)

-- Extract typed data from Variant
VARIANT_GET(variant_column, '$.path', 'type')

-- Supported types: 'string', 'int', 'double', 'boolean', 'array', 'object'
```

### File Structure:
```
variants/
├── data_utility.py              # All data generation utilities
├── ecommerce_event_analytics.py # E-commerce analysis
├── iot_sensor_processing.py     # IoT sensor analysis  
├── security_log_analysis.py     # Security log analysis
└── spark_variant_analysis_notebook.ipynb # This notebook
```
