# Spark Docker Demo - Dual Environment Support ✨

**✅ FIXED**: This notebook now works in both VS Code and JupyterLab container!

## 🎯 Auto Environment Detection

This notebook automatically detects and configures for:

### 💻 **VS Code Local Mode**
- **Spark**: Local instance (`local[*]`)
- **Events**: Saved to local directory 
- **UI**: http://localhost:4040
- **Perfect for**: Development and testing

### 🐳 **Docker Container Mode** 
- **Spark**: Cluster connection (`spark://spark-master:7077`)
- **Events**: Shared with History Server
- **UI**: Full cluster monitoring
- **Perfect for**: Production-like environment

## 🚀 How to Use:

### Option 1: VS Code (Current)
1. Run cells directly in VS Code
2. Events saved to `/Users/.../spark-docker/events`
3. View at: http://localhost:4040

### Option 2: JupyterLab Container
1. Make sure containers are running: `docker compose up -d`
2. Open browser: **http://localhost:8888**
3. Navigate to this notebook and run cells
4. View cluster at: http://localhost:8080

## 🔧 Current Setup:
- **Spark Version**: 3.5.0 (Optimized)
- **Auto Detection**: ✅ Working
- **Event Logging**: ✅ Enabled for both environments
- **Health Checks**: ✅ All services healthy
- **Dependencies**: ✅ Perfect startup sequence

## 📊 Monitoring URLs:
- **Local Spark UI**: http://localhost:4040 (VS Code)
- **Master UI**: http://localhost:8080 (Docker cluster)
- **History Server**: http://localhost:18080 (All events)
- **JupyterLab**: http://localhost:8888 (Container access)

## ✨ New Features:
- ✅ **Dual Environment Support**: Works in VS Code + Docker
- ✅ **Auto Configuration**: Detects environment automatically  
- ✅ **Event Generation**: Creates events in both modes
- ✅ **Performance Testing**: Caching and optimization demos
- ✅ **Comprehensive Monitoring**: Full cluster information

## 🐛 Troubleshooting:
- **VS Code**: Events saved locally, check `/events` directory
- **Container**: If connection fails, restart: `docker compose restart`
- **UI Access**: All UIs should be accessible simultaneously
- **Events**: History Server shows events from both environments

In [None]:
from pyspark.sql import SparkSession
import time
import os
import socket

# Detect environment và cấu hình phù hợp
def detect_environment():
    """Detect if running in Docker container or local VS Code"""
    try:
        # Check if we're in JupyterLab container
        hostname = socket.gethostname()
        if "jupyter" in hostname or os.path.exists("/home/jovyan"):
            return "jupyter_container"
        # Check if we can connect to Spark cluster
        elif os.path.exists("/Users/congdinh/Downloads/work/content/de/spark-docker/events"):
            return "vscode_local"
        else:
            return "vscode_local"
    except:
        return "vscode_local"

environment = detect_environment()
print(f"🔍 Detected environment: {environment}")

# Cấu hình dựa trên environment
if environment == "jupyter_container":
    print("🐳 Running in JupyterLab container - using cluster configuration")
    spark_config = SparkSession.builder \
        .appName("OptimizedSparkDemo") \
        .master("spark://spark-master:7077") \
        .config("spark.driver.bindAddress", "0.0.0.0") \
        .config("spark.driver.host", "jupyter-lab") \
        .config("spark.eventLog.enabled", "true") \
        .config("spark.eventLog.dir", "file:///events")
else:
    print("💻 Running in VS Code - using local configuration")
    # Tạo local events directory nếu chưa có
    local_events_dir = "/Users/congdinh/Downloads/work/content/de/spark-docker/events"
    os.makedirs(local_events_dir, exist_ok=True)
    
    spark_config = SparkSession.builder \
        .appName("LocalSparkDemo") \
        .master("local[*]") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.eventLog.enabled", "true") \
        .config("spark.eventLog.dir", f"file://{local_events_dir}") \
        .config("spark.ui.enabled", "true") \
        .config("spark.ui.port", "4040")

# Tạo SparkSession
spark = spark_config.getOrCreate()

print(f"✅ Spark Version: {spark.version}")
print(f"🎯 Spark Master URL: {spark.conf.get('spark.master')}")
print(f"📱 Spark App Name: {spark.conf.get('spark.app.name')}")
print(f"📝 Event Log Dir: {spark.conf.get('spark.eventLog.dir')}")

# Test với data đơn giản
print("\n=== Basic DataFrame Test ===")
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()

# Test performance với data lớn hơn
print("\n=== Performance Test ===")
start_time = time.time()

# Tạo DataFrame lớn hơn để test
large_data = [(f"User_{i}", i % 100) for i in range(10000)]
large_df = spark.createDataFrame(large_data, ["Name", "Age"])

# Thực hiện một số operations
result = large_df.groupBy("Age").count().orderBy("Age")
result.show(10)

end_time = time.time()
print(f"⏱️ Processing time: {end_time - start_time:.2f} seconds")

print("\n=== Spark UI URLs ===")
print(f"🌐 Spark Application UI: {spark.sparkContext.uiWebUrl}")
if environment == "jupyter_container":
    print("🔗 Master UI: http://localhost:8080")
    print("📊 History Server: http://localhost:18080")
else:
    print("🔗 Local Spark UI: http://localhost:4040")
    print("📁 Events Directory: /Users/congdinh/Downloads/work/content/de/spark-docker/events")

print(f"\n✨ Environment: {environment}")
print("🚀 SparkSession is ready for use!")

# Không stop context để có thể xem UI và events
# spark.stop()  # Comment out để giữ SparkSession active