# NEMWEB Custom Data Source Demo

This notebook demonstrates how to use the NEMWEB custom PySpark data source from the `src` folder.

**Requirements:**
- Databricks Runtime 15.4+ or Serverless (Environment Version 4)
- Python Data Source API (GA in Spark 4.0)

**What this notebook covers:**
1. Installing the package from the src folder
2. Registering the custom data source
3. Batch reading from NEMWEB API
4. Exploring the data

## 1. Setup - Install Package from src Folder

In [None]:
# Install the nemweb package from the src folder
# This makes the nemweb_datasource module available for import
%pip install -e /Workspace/Repos/{your-username}/databricks-nemweb-lab/databricks-nemweb-lab/src --quiet

In [None]:
# Restart Python to pick up the installed package
dbutils.library.restartPython()

## 2. Register the Custom Data Source

In [None]:
from databricks.sdk.runtime import spark, dbutils
from datetime import datetime, timedelta

# Import the custom data source
from nemweb_datasource import NemwebDataSource

# Register with Spark - this enables spark.read.format("nemweb")
spark.dataSource.register(NemwebDataSource)

print("NEMWEB data source registered successfully!")
print("You can now use: spark.read.format('nemweb')...")

## 3. Batch Read - Fetch Data from NEMWEB API

The NEMWEB data source fetches real electricity market data from AEMO's public API.

**Available options:**
- `table`: MMS table name (default: `DISPATCHREGIONSUM`)
- `regions`: Comma-separated NEM regions (default: `NSW1,QLD1,SA1,VIC1,TAS1`)
- `start_date`: Start date in `YYYY-MM-DD` format
- `end_date`: End date in `YYYY-MM-DD` format

In [None]:
# Use yesterday's date (guaranteed to exist in NEMWEB CURRENT folder)
yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")

# Read a single region for a single day (fast demo)
df = (spark.read
      .format("nemweb")
      .option("table", "DISPATCHREGIONSUM")
      .option("regions", "NSW1")
      .option("start_date", yesterday)
      .option("end_date", yesterday)
      .load())

print(f"Fetched {df.count()} rows for NSW1 on {yesterday}")
print(f"Expected: ~288 rows (24 hours Ã— 12 intervals/hour at 5-min granularity)")

In [None]:
# Display the schema
df.printSchema()

In [None]:
# Display sample data
display(df.orderBy("SETTLEMENTDATE").limit(20))

## 4. Read Multiple Regions

The data source creates one partition per region, enabling parallel reads.

In [None]:
# Read all 5 NEM regions
df_all_regions = (spark.read
                  .format("nemweb")
                  .option("regions", "NSW1,QLD1,SA1,VIC1,TAS1")
                  .option("start_date", yesterday)
                  .option("end_date", yesterday)
                  .load())

print(f"Total rows: {df_all_regions.count()}")
print(f"Partitions: {df_all_regions.rdd.getNumPartitions()}")

In [None]:
# Check row counts by region
display(
    df_all_regions
    .groupBy("REGIONID")
    .count()
    .orderBy("REGIONID")
)

## 5. Analyze the Data

Run some basic analytics on the electricity market data.

In [None]:
from pyspark.sql.functions import col, avg, max, min, sum

# Regional demand summary
summary = (df_all_regions
           .groupBy("REGIONID")
           .agg(
               avg("TOTALDEMAND").alias("avg_demand_mw"),
               max("TOTALDEMAND").alias("peak_demand_mw"),
               min("TOTALDEMAND").alias("min_demand_mw"),
               avg("AVAILABLEGENERATION").alias("avg_generation_mw")
           )
           .orderBy(col("avg_demand_mw").desc()))

display(summary)

In [None]:
from pyspark.sql.functions import hour

# Hourly demand pattern for NSW
hourly_demand = (df_all_regions
                 .filter(col("REGIONID") == "NSW1")
                 .withColumn("hour", hour("SETTLEMENTDATE"))
                 .groupBy("hour")
                 .agg(avg("TOTALDEMAND").alias("avg_demand_mw"))
                 .orderBy("hour"))

display(hourly_demand)

## 6. Save to Delta Table (Optional)

Persist the data to a Delta table for downstream analytics.

In [None]:
# Uncomment to save to a Delta table
# catalog = "main"
# schema = "nemweb_lab"
# table = "dispatch_region_sum"

# # Create schema if not exists
# spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema}")

# # Write to Delta table
# (df_all_regions
#  .write
#  .mode("append")
#  .saveAsTable(f"{catalog}.{schema}.{table}"))

# print(f"Data saved to {catalog}.{schema}.{table}")

## 7. Multi-Day Read with Checkpointing (Advanced)

For production workloads, use checkpointing to track progress and enable resumability.

In [None]:
# Read a week of data with checkpoint tracking
# This creates partitions for each (region, date) combination

# week_ago = (datetime.now() - timedelta(days=7)).strftime("%Y-%m-%d")

# df_week = (spark.read
#            .format("nemweb")
#            .option("regions", "NSW1,VIC1")
#            .option("start_date", week_ago)
#            .option("end_date", yesterday)
#            .option("checkpoint_table", "main.nemweb_lab.checkpoints")  # Enable checkpointing
#            .load())

# print(f"Partitions: {df_week.rdd.getNumPartitions()}")
# print(f"Total rows: {df_week.count()}")

## Summary

You've successfully:
1. Installed the NEMWEB package from the src folder
2. Registered the custom data source with Spark
3. Fetched real electricity market data from the NEMWEB API
4. Analyzed demand patterns across NEM regions

**Next Steps:**
- Explore the Lakeflow Pipeline notebook to build a bronze/silver/gold medallion architecture
- Check out the streaming capabilities for real-time ingestion
- Review the `nemweb_utils.py` for retry logic and error handling patterns