# 00 Environment Setup (Local Development)

This notebook configures the environment for **local development only**. It initializes Spark, loads utilities, and provides cross-platform compatibility helpers.

## Usage

**⚠️ IMPORTANT: This notebook is for LOCAL DEVELOPMENT ONLY**

### In Local Jupyter/VS Code
```python
# Uncomment this line to run the setup
%run 00_Environment_Setup.ipynb
```

### In Databricks
**DO NOT RUN THIS NOTEBOOK**
- Databricks has `spark` pre-configured
- All necessary libraries are already available
- Simply use the imports you need directly

## What This Notebook Provides

After running this setup notebook in local environment, you'll have access to:
- `spark` - Configured SparkSession for local development
- `dbutils` - Mock dbutils for local compatibility
- `get_storage_path()` - Platform-aware path helper
- `F` - PySpark functions (pyspark.sql.functions)
- All SQL types imported

In [None]:
# Environment Setup - Local Development Configuration
# This notebook is for LOCAL DEVELOPMENT ONLY

import sys
from pathlib import Path

# Add utils to Python path for local development
utils_path = Path("../utils") if Path("../utils").exists() else Path("./utils")
if utils_path.exists() and str(utils_path.parent.resolve()) not in sys.path:
    sys.path.insert(0, str(utils_path.parent.resolve()))

# Import utilities
try:
    from utils.spark_utils import (
        get_spark_session,
        get_dbutils_mock,
        get_storage_path
    )
    _utils_available = True
except ImportError:
    print("⚠️  Warning: utils module not found. Some features may not be available.")
    print("   Make sure utils/ directory exists and contains spark_utils.py")
    _utils_available = False

# Import PySpark essentials
import pyspark.sql.functions as F
from pyspark.sql.types import *

print("✓ Imports loaded successfully")

In [None]:
# Initialize Spark Session for Local Development
# Creates a new SparkSession configured for local use

if _utils_available:
    # Create local SparkSession without Delta due to version compatibility
    # Note: Enable Delta if you have compatible delta-spark installed
    spark = get_spark_session(app_name="PySpark_Best_Practices", enable_delta=False)
    print("✓ Created local SparkSession")
    print("   Note: Delta Lake disabled by default for compatibility")
    print("   To enable: Install matching delta-spark version and set enable_delta=True")
else:
    # Fallback if utils not available
    try:
        spark
        print("✓ Using existing SparkSession")
    except NameError:
        from pyspark.sql import SparkSession
        spark = SparkSession.builder.appName("PySpark_Best_Practices").getOrCreate()
        print("✓ Created basic SparkSession")

In [None]:
# Initialize dbutils for Local Development
# Provides mock dbutils for compatibility with Databricks code

if _utils_available:
    dbutils = get_dbutils_mock()
    print("✓ Using mock dbutils for local development")
else:
    # Fallback: no dbutils available
    print("ℹ️  dbutils not available (utils module not found)")
    dbutils = None

In [None]:
# Display Environment Information

print("="*60)
print("LOCAL ENVIRONMENT SETUP COMPLETE")
print("="*60)

# Platform information
print(f"\n📍 Platform: Local Python")

# Spark information
print(f"\n⚡ Spark Configuration:")
print(f"   Version: {spark.version}")
print(f"   App Name: {spark.sparkContext.appName}")

# Check for Delta Lake support
try:
    from delta import DeltaTable
    print(f"   Delta Lake: ✓ Available")
except ImportError:
    print(f"   Delta Lake: ✗ Not available")

# Python information  
print(f"\n🐍 Python Version: {sys.version.split()[0]}")

# Available utilities
print(f"\n🔧 Available Utilities:")
if _utils_available:
    print(f"   - get_storage_path(): Get platform-aware storage paths")
    print(f"   - get_spark_session(): Create configured Spark session")
    print(f"   - dbutils: Mock implementation for local development")
else:
    print(f"   - Basic setup only (utils module not found)")

print(f"\n📚 Common Imports:")
print(f"   - pyspark.sql.functions as F")
print(f"   - pyspark.sql.types (all types)")

print("\n" + "="*60)
print("Ready to run PySpark code!")
print("="*60)

## Quick Start Examples

After running the setup above, you can immediately start using PySpark:

In [None]:
# Example: Create a simple DataFrame
sample_data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
sample_df = spark.createDataFrame(sample_data, ["name", "age"])

print("Sample DataFrame:")
sample_df.show()

# Example: Use PySpark functions
result = sample_df.withColumn("age_plus_10", F.col("age") + 10)
result.show()

## Platform-Specific Features

### Storage Paths

Use `get_storage_path()` to get platform-appropriate paths:

In [None]:
# Get platform-appropriate storage path
if _utils_available:
    example_path = get_storage_path("my_table", format_type="delta")
    print(f"Storage path for 'my_table': {example_path}")
    
    # Example: Save DataFrame with platform-aware path
    # sample_df.write.format("delta").mode("overwrite").save(example_path)
else:
    print("Storage path utilities not available (utils module not found)")

25/10/04 23:48:56 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 272365 ms exceeds timeout 120000 ms
25/10/04 23:48:56 WARN SparkContext: Killing executors is not supported by current scheduler.
25/10/04 23:48:59 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:80)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:642)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1223)
	at o

## Troubleshooting

### If utils module is not found:
1. Ensure you have the `utils/` directory in your project root
2. Check that `utils/spark_utils.py` exists
3. Verify the notebook path relative to the utils directory

### If Delta Lake is not available:
- **Local**: Install with `pip install delta-spark`
- **Databricks**: Delta is pre-installed in Runtime 12.2+

### If Spark session fails:
- **Local**: Ensure Java 11 or 17 is installed (`java -version`)
- Check that `JAVA_HOME` is set correctly