## Step 1: Initialize Spark Session

Create Spark session. Configuration is automatically loaded from `spark-defaults.conf`.

# OLake Weather Data Analysis with Apache Spark

This notebook demonstrates querying Iceberg tables using Apache Spark after data has been synced from MySQL to Iceberg via OLake.

## Configuration

All Spark and Iceberg configurations are loaded from mounted configuration files:
- `/opt/spark/conf/spark-defaults.conf` - Spark and Iceberg settings
- `/opt/spark/conf/core-site.xml` - Hadoop S3A configuration  
- `/opt/spark/conf/catalog/iceberg.properties` - Iceberg REST catalog settings

In [None]:
from pyspark.sql import SparkSession
import os

# Verify configuration files are mounted
print("Configuration files:")
print(f"  SPARK_CONF_DIR: {os.getenv('SPARK_CONF_DIR')}")
print(f"  HADOOP_CONF_DIR: {os.getenv('HADOOP_CONF_DIR')}")

# Create Spark session - configs loaded from spark-defaults.conf
spark = SparkSession.builder \
    .appName("OLake Weather Analysis") \
    .getOrCreate()

print("\n✅ Spark session initialized successfully!")
print(f"Spark version: {spark.version}")
print(f"Spark master: {spark.sparkContext.master}")

## Step 2: Verify Configuration

Check that Iceberg catalog configuration was loaded correctly.

In [None]:
# Display loaded Spark configurations related to Iceberg
print("Loaded Iceberg Configuration:")
print("="*60)
for key, value in spark.sparkContext.getConf().getAll():
    if 'iceberg' in key.lower() or 's3' in key.lower():
        # Mask sensitive values
        if 'secret' in key.lower() or 'password' in key.lower():
            value = '***MASKED***'
        print(f"{key}: {value}")

## Step 3: Verify Iceberg Catalog Connection

Check available catalogs and namespaces.

In [None]:
# Show available catalogs
print("Available Catalogs:")
spark.sql("SHOW CATALOGS").show()

# Show namespaces in Iceberg catalog
print("\nNamespaces in Iceberg Catalog:")
spark.sql("SHOW NAMESPACES IN iceberg").show()

## Step 4: List Tables in Weather Database

After running the OLake sync job, you should see the `weather` table here.

In [None]:
# List tables in weather namespace
print("Tables in weather namespace:")
spark.sql("SHOW TABLES IN iceberg.weather").show()

## Step 5: Query Weather Data

Now let's query the actual weather data synced from MySQL!

In [None]:
# Read the weather table
weather_df = spark.table("iceberg.weather.weather")

print(f"Total records: {weather_df.count():,}")
print("\nFirst 10 rows:")
weather_df.show(10, truncate=False)

In [None]:
# Check the schema
print("Weather table schema:")
weather_df.printSchema()

## Step 6: Data Analysis Examples

Perform various analytics on the weather data using Spark SQL.

In [None]:
# Average temperature by state
print("Average Temperature by State (Top 10):")
avg_temp_by_state = spark.sql("""
    SELECT 
        station_state,
        ROUND(AVG(temperature_avg), 2) as avg_temp,
        ROUND(AVG(temperature_max), 2) as avg_max_temp,
        ROUND(AVG(temperature_min), 2) as avg_min_temp,
        COUNT(*) as record_count
    FROM iceberg.weather.weather
    WHERE temperature_avg IS NOT NULL
    GROUP BY station_state
    ORDER BY avg_temp DESC
    LIMIT 10
""")

avg_temp_by_state.show()

In [None]:
# Cities with highest precipitation
print("Cities with Highest Precipitation (Top 10):")
high_precipitation = spark.sql("""
    SELECT 
        station_city,
        station_state,
        ROUND(AVG(precipitation), 2) as avg_precipitation,
        COUNT(*) as measurements
    FROM iceberg.weather.weather
    WHERE precipitation IS NOT NULL AND precipitation > 0
    GROUP BY station_city, station_state
    ORDER BY avg_precipitation DESC
    LIMIT 10
""")

high_precipitation.show(truncate=False)

In [None]:
# Temperature distribution statistics
print("Temperature Distribution Statistics:")
temp_stats = spark.sql("""
    SELECT 
        COUNT(*) as total_records,
        ROUND(MIN(temperature_min), 2) as coldest_temp,
        ROUND(MAX(temperature_max), 2) as hottest_temp,
        ROUND(AVG(temperature_avg), 2) as overall_avg_temp,
        ROUND(STDDEV(temperature_avg), 2) as temp_std_dev
    FROM iceberg.weather.weather
    WHERE temperature_avg IS NOT NULL
""")

temp_stats.show()

## Step 7: Advanced Analytics with DataFrame API

Use Spark's DataFrame API for more complex transformations.

In [None]:
from pyspark.sql.functions import col, avg, count, round as spark_round

# Monthly temperature trends by year
monthly_trends = weather_df \
    .filter(col("temperature_avg").isNotNull()) \
    .groupBy("date_year", "date_month") \
    .agg(
        spark_round(avg("temperature_avg"), 2).alias("avg_temp"),
        count("*").alias("record_count")
    ) \
    .orderBy("date_year", "date_month")

print("Monthly Temperature Trends:")
monthly_trends.show(20)

In [None]:
# Wind speed analysis by state
from pyspark.sql.functions import max as spark_max, min as spark_min

wind_analysis = weather_df \
    .filter(col("wind_speed").isNotNull()) \
    .groupBy("station_state") \
    .agg(
        spark_round(avg("wind_speed"), 2).alias("avg_wind_speed"),
        spark_round(spark_max("wind_speed"), 2).alias("max_wind_speed"),
        spark_round(spark_min("wind_speed"), 2).alias("min_wind_speed")
    ) \
    .orderBy(col("avg_wind_speed").desc())

print("Wind Speed Analysis by State (Top 10):")
wind_analysis.show(10)

## Step 8: Data Export (Optional)

Convert query results to pandas for visualization.

In [None]:
# Convert to pandas for visualization (only for small datasets)
avg_temp_pandas = avg_temp_by_state.toPandas()
print("Converted to Pandas DataFrame:")
print(avg_temp_pandas)

## Summary

This notebook demonstrated:
1. ✅ Loading Spark configuration from external files 
2. ✅ Connecting Spark to Iceberg REST catalog
3. ✅ Querying weather data synced from MySQL via OLake
4. ✅ Performing aggregations and analytics using Spark SQL
5. ✅ Using DataFrame API for complex transformations

### Configuration Architecture

This setup follows:
- Spark: Uses `/opt/spark/conf` for configuration files

In [None]:
# Optional: Stop the Spark session when completely done
# spark.stop()