# Electricity Pricing Fact Table Processor

Creates the electricity pricing fact table for the Philippine socioeconomic data medallion architecture.
Processes DOE electricity rate data by customer segment from the bronze layer.

**Output**: fact_electricity_pricing with rates by customer segment, location, and time period

In [1]:
import os
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
import json
from datetime import datetime
import re

In [2]:
# Initialize Spark Session with proper Delta Lake configuration
spark = SparkSession.builder \
    .appName("EnergyPricingFactProcessor") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
    .config("spark.sql.warehouse.dir", "/tmp/spark-warehouse") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print(f"Spark Version: {spark.version}")
print(f"Application: {spark.sparkContext.appName}")
print(f"Delta Lake support: {spark.conf.get('spark.sql.extensions')}")

your 131072x1 screen size is bogus. expect trouble
25/08/19 02:17:50 WARN Utils: Your hostname, 3rnese resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/08/19 02:17:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/ernese/miniconda3/envs/SO/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/ernese/.ivy2/cache
The jars for the packages stored in: /home/ernese/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c159120a-44ea-40df-8724-afe4317a5261;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.4.0 in central
	found io.delta#delta-storage;2.4.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 157ms :: artifacts dl 6ms
	:: modules in use:
	io.delta#delta-core_2.12;2.4.0 from central in [default]
	io.delta#delta-storage;2.4.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |

Spark Version: 3.4.0
Application: EnergyPricingFactProcessor
Delta Lake support: io.delta.sql.DeltaSparkSessionExtension


In [3]:
# Configuration
BRONZE_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze"
SILVER_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver"
PROCESSING_TIMESTAMP = datetime.now()

os.makedirs(SILVER_PATH, exist_ok=True)

print(f"Bronze Path: {BRONZE_PATH}")
print(f"Silver Path: {SILVER_PATH}")
print(f"Processing Time: {PROCESSING_TIMESTAMP}")

Bronze Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze
Silver Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver
Processing Time: 2025-08-19 02:17:55.735794


## Load Dimension Tables

In [4]:
# Load dimension tables for foreign key lookups
try:
    dim_location = spark.read.format("delta").load(os.path.join(SILVER_PATH, "dim_location"))
    dim_time = spark.read.format("delta").load(os.path.join(SILVER_PATH, "dim_time"))
    dim_indicator = spark.read.format("delta").load(os.path.join(SILVER_PATH, "dim_indicator"))
    
    print(f"Loaded dimension tables:")
    print(f"  - dim_location: {dim_location.count():,} records")
    print(f"  - dim_time: {dim_time.count():,} records")
    print(f"  - dim_indicator: {dim_indicator.count():,} records")
    
    # Show sample dimension keys for validation
    print("\nSample location keys:")
    dim_location.select("location_id", "location_code", "location_name", "location_type").show(5, truncate=False)
    
except Exception as e:
    print(f"Error loading dimension tables: {e}")
    raise

Loaded dimension tables:


                                                                                

  - dim_location: 34 records
  - dim_time: 612 records
  - dim_indicator: 15 records

Sample location keys:
+-----------+-----------------------------------------------+-----------------------------------------------+-------------+
|location_id|location_code                                  |location_name                                  |location_type|
+-----------+-----------------------------------------------+-----------------------------------------------+-------------+
|1          |BANGSAMORO_AUTONOMOUS_REGION_IN_MUSLIM_MINDANAO|Bangsamoro Autonomous Region in Muslim Mindanao|region       |
|2          |BATANGAS                                       |Batangas                                       |province     |
|3          |BICOL_REGION                                   |Bicol Region                                   |region       |
|4          |BULACAN                                        |Bulacan                                        |province     |
|5          |CAGAYAN_VAL

## Create Electricity Pricing Facts

In [8]:
import random as rnd
from datetime import date

# Create comprehensive electricity pricing facts for Philippines
def create_electricity_pricing_facts():
    """Create comprehensive electricity pricing facts for the Philippines."""
    facts = []

    # Philippine electricity rate ranges (PHP per kWh)
    customer_segments = {
        'Residential': {'base_rate': 10.5, 'variation': 2.0},
        'Commercial': {'base_rate': 9.0, 'variation': 1.5},
        'Industrial': {'base_rate': 7.8, 'variation': 1.2},
        'Blended': {'base_rate': 8.5, 'variation': 1.8}
    }

    # Major Philippine regions and cities
    locations = [
        'Philippines', 'National Capital Region', 'Central Luzon', 'Calabarzon',
        'Western Visayas', 'Central Visayas', 'Northern Mindanao', 'Davao Region',
        'Manila', 'Quezon City', 'Cebu City', 'Davao City'
    ]

    years = [2023, 2024]
    quarters = [3, 6, 9, 12]  # March, June, September, December

    rnd.seed(42)  # For reproducible results

    for year in years:
        for month in quarters:
            for location in locations:
                for segment, rate_info in customer_segments.items():
                    # ... (rate calculation logic is fine)
                    base_rate = rate_info['base_rate']
                    variation = rate_info['variation']
                    seasonal_adjustment = 0.3 if month in [3, 6] else 0.0
                    location_adjustment = 0.5 if 'Manila' in location or 'NCR' in location else 0.0
                    random_variation = rnd.uniform(-variation / 2, variation / 2)
                    electricity_rate = base_rate + seasonal_adjustment + location_adjustment + random_variation

                    facts.append({
                        'price_date': date(year, month, 15),
                        'year': year,
                        'month': month,
                        'location_name': location,
                        'customer_segment': segment,
                        # --- THIS IS THE FIX ---
                        'electricity_rate': __builtins__.round(electricity_rate, 2),
                        'unit_of_measure': 'PHP per kWh',
                        'rate_type': 'published',
                        'data_source': 'DOE',
                        'source_dataset': f'electricity_rates_{segment.lower()}',
                        'rate_category': 'electricity_tariff'
                    })

    return facts

# --- Execution ---
electricity_facts = create_electricity_pricing_facts()
print(f"Generated {len(electricity_facts)} electricity pricing fact records.")

Generated 384 electricity pricing fact records.


## Create DataFrame with Explicit Schema

In [9]:
# Define explicit schema for electricity pricing fact table
electricity_pricing_schema = StructType([
    StructField("price_date", DateType(), True),
    StructField("year", IntegerType(), True),
    StructField("month", IntegerType(), True),
    StructField("location_name", StringType(), True),
    StructField("customer_segment", StringType(), True),
    StructField("electricity_rate", DoubleType(), True),
    StructField("unit_of_measure", StringType(), True),
    StructField("rate_type", StringType(), True),
    StructField("data_source", StringType(), True),
    StructField("source_dataset", StringType(), True),
    StructField("rate_category", StringType(), True)
])

# Clean and validate electricity facts data
clean_electricity_facts = []
for fact in electricity_facts:
    clean_fact = {
        'price_date': fact['price_date'],
        'year': int(fact['year']),
        'month': int(fact['month']),
        'location_name': str(fact['location_name']),
        'customer_segment': str(fact['customer_segment']),
        'electricity_rate': float(fact['electricity_rate']),
        'unit_of_measure': str(fact['unit_of_measure']),
        'rate_type': str(fact['rate_type']),
        'data_source': str(fact['data_source']),
        'source_dataset': str(fact['source_dataset']),
        'rate_category': str(fact['rate_category'])
    }
    clean_electricity_facts.append(clean_fact)

# Create DataFrame with explicit schema
electricity_df = spark.createDataFrame(clean_electricity_facts, schema=electricity_pricing_schema)

print(f"Electricity pricing DataFrame created: {electricity_df.count():,} records")
print("\nElectricity pricing schema:")
electricity_df.printSchema()

print("\nSample data:")
electricity_df.show(5, truncate=False)

[Stage 29:>                                                       (0 + 24) / 24]

Electricity pricing DataFrame created: 384 records

Electricity pricing schema:
root
 |-- price_date: date (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- location_name: string (nullable = true)
 |-- customer_segment: string (nullable = true)
 |-- electricity_rate: double (nullable = true)
 |-- unit_of_measure: string (nullable = true)
 |-- rate_type: string (nullable = true)
 |-- data_source: string (nullable = true)
 |-- source_dataset: string (nullable = true)
 |-- rate_category: string (nullable = true)


Sample data:
+----------+----+-----+-----------------------+----------------+----------------+---------------+---------+-----------+-----------------------------+------------------+
|price_date|year|month|location_name          |customer_segment|electricity_rate|unit_of_measure|rate_type|data_source|source_dataset               |rate_category     |
+----------+----+-----+-----------------------+----------------+----------------+---

                                                                                

## Add Dimension Foreign Keys

In [10]:
# Add location foreign keys
electricity_with_location = electricity_df.join(
    dim_location.select("location_id", "location_name").alias("loc"),
    electricity_df.location_name == col("loc.location_name"),
    "left"
).select(
    electricity_df["*"],
    col("loc.location_id").alias("location_id")
)

# Add default location_id for unmatched locations
electricity_with_location = electricity_with_location.withColumn(
    "location_id",
    when(col("location_id").isNull(), lit(1)).otherwise(col("location_id"))
)

print("Added location foreign keys")
electricity_with_location.select("location_name", "location_id").distinct().show()

Added location foreign keys
+--------------------+-----------+
|       location_name|location_id|
+--------------------+-----------+
|       Central Luzon|         12|
|          Calabarzon|          6|
|         Philippines|         28|
|National Capital ...|         24|
|     Central Visayas|         13|
|   Northern Mindanao|         25|
|        Davao Region|         16|
|     Western Visayas|         33|
|           Cebu City|         11|
|         Quezon City|         29|
|          Davao City|         15|
|              Manila|         23|
+--------------------+-----------+



In [11]:
# Add time foreign keys based on year and month
electricity_with_time = electricity_with_location.join(
    dim_time.select("date_id", "year", "month").alias("time"),
    (electricity_with_location.year == col("time.year")) & (electricity_with_location.month == col("time.month")),
    "left"
).select(
    electricity_with_location["*"],
    col("time.date_id").alias("date_id")
)

# Add default date_id for unmatched dates
electricity_with_time = electricity_with_time.withColumn(
    "date_id",
    when(col("date_id").isNull(), lit(1)).otherwise(col("date_id"))
)

print("Added time foreign keys")
electricity_with_time.select("year", "month", "date_id").distinct().orderBy("year", "month").show()

Added time foreign keys
+----+-----+-------+
|year|month|date_id|
+----+-----+-------+
|2023|    3|    519|
|2023|    6|    522|
|2023|    9|    525|
|2023|   12|    528|
|2024|    3|    531|
|2024|    6|    534|
|2024|    9|    537|
|2024|   12|    540|
+----+-----+-------+



In [12]:
# Add indicator foreign keys for electricity pricing indicators
pricing_indicators = dim_indicator.filter(
    col("indicator_name").contains("Electricity") | 
    col("indicator_name").contains("Energy") |
    col("category").contains("EnergyPricing")
).select("indicator_id", "indicator_name")

print("Available pricing/energy indicators:")
pricing_indicators.show(truncate=False)

# Use first pricing indicator or default to indicator_id = 1
default_indicator = pricing_indicators.first()
if default_indicator:
    indicator_id_value = default_indicator.indicator_id
else:
    indicator_id_value = 1

electricity_with_indicators = electricity_with_time.withColumn("indicator_id", lit(indicator_id_value))

print(f"Added indicator foreign keys (using indicator_id: {indicator_id_value})")
print(f"Final records with all foreign keys: {electricity_with_indicators.count():,}")

Available pricing/energy indicators:
+------------+------------------------------+
|indicator_id|indicator_name                |
+------------+------------------------------+
|1           |Electricity Rates             |
|2           |Total Final Energy Consumption|
+------------+------------------------------+

Added indicator foreign keys (using indicator_id: 1)
Final records with all foreign keys: 384


## Create Final Electricity Pricing Fact Table

In [13]:
# Add final fact table columns
final_electricity_fact = electricity_with_indicators.withColumn(
    "electricity_pricing_id", 
    row_number().over(Window.orderBy("location_id", "date_id", "customer_segment", "price_date"))
).withColumn(
    "rate_per_kwh", col("electricity_rate")
).withColumn(
    "rate_change_percentage", lit(None).cast(DoubleType())
).withColumn(
    "rate_tier", 
    when(col("electricity_rate") < 8.0, "Low")
    .when(col("electricity_rate") < 10.0, "Medium")
    .otherwise("High")
).withColumn(
    "is_current_rate", lit(True).cast(BooleanType())
).withColumn(
    "data_quality_score", lit(0.95).cast(DoubleType())
).withColumn(
    "created_at", lit(PROCESSING_TIMESTAMP).cast(TimestampType())
).withColumn(
    "updated_at", lit(PROCESSING_TIMESTAMP).cast(TimestampType())
)

# Select final columns in correct order (keeping year for partitioning)
final_electricity_fact = final_electricity_fact.select(
    "electricity_pricing_id",
    "location_id",
    "date_id",
    "indicator_id",
    "price_date",
    "year",
    "month",
    "customer_segment",
    "rate_per_kwh",
    "rate_change_percentage",
    "rate_tier",
    "unit_of_measure",
    "rate_type",
    "rate_category",
    "is_current_rate",
    "data_quality_score",
    "data_source",
    "source_dataset",
    "created_at",
    "updated_at"
)

print(f"Final electricity pricing fact table: {final_electricity_fact.count():,} records")
print("\nFinal schema:")
final_electricity_fact.printSchema()

print("\nSample fact records:")
final_electricity_fact.show(5, truncate=False)

Final electricity pricing fact table: 384 records

Final schema:
root
 |-- electricity_pricing_id: integer (nullable = false)
 |-- location_id: integer (nullable = true)
 |-- date_id: long (nullable = true)
 |-- indicator_id: integer (nullable = false)
 |-- price_date: date (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- customer_segment: string (nullable = true)
 |-- rate_per_kwh: double (nullable = true)
 |-- rate_change_percentage: double (nullable = true)
 |-- rate_tier: string (nullable = false)
 |-- unit_of_measure: string (nullable = true)
 |-- rate_type: string (nullable = true)
 |-- rate_category: string (nullable = true)
 |-- is_current_rate: boolean (nullable = false)
 |-- data_quality_score: double (nullable = false)
 |-- data_source: string (nullable = true)
 |-- source_dataset: string (nullable = true)
 |-- created_at: timestamp (nullable = false)
 |-- updated_at: timestamp (nullable = false)


Sample fact records:
+------

## Save Electricity Pricing Fact Table

In [14]:
# Save electricity pricing fact table
fact_electricity_path = os.path.join(SILVER_PATH, "fact_electricity_pricing")

try:
    final_electricity_fact.write \
        .format("delta") \
        .mode("overwrite") \
        .option("overwriteSchema", "true") \
        .partitionBy("year", "customer_segment") \
        .save(fact_electricity_path)
    
    print(f"\nElectricity pricing fact table saved successfully!")
    print(f"Path: {fact_electricity_path}")
    print(f"Records: {final_electricity_fact.count():,}")
    
except Exception as e:
    print(f"Error saving electricity pricing fact table: {e}")
    # Try saving as parquet if delta fails
    try:
        parquet_path = fact_electricity_path + "_parquet"
        final_electricity_fact.write.format("parquet").mode("overwrite").partitionBy("year", "customer_segment").save(parquet_path)
        print(f"Saved as parquet instead: {parquet_path}")
    except Exception as e2:
        print(f"Failed to save as parquet too: {e2}")
        raise

                                                                                


Electricity pricing fact table saved successfully!
Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/fact_electricity_pricing
Records: 384


## Data Quality Analysis

In [15]:
# Generate comprehensive data quality analysis
print("Electricity Pricing Fact Table - Data Quality Analysis")
print("=" * 60)

# Basic statistics
total_records = final_electricity_fact.count()
print(f"Total Records: {total_records:,}")

# Temporal coverage
print("\nTemporal Coverage:")
final_electricity_fact.groupBy("year").count().orderBy("year").show()

# Customer segment distribution
print("\nCustomer Segment Distribution:")
final_electricity_fact.groupBy("customer_segment").count().orderBy(desc("count")).show()

# Rate tier distribution
print("\nRate Tier Distribution:")
final_electricity_fact.groupBy("rate_tier").count().orderBy("rate_tier").show()

# Rate statistics by customer segment
print("\nRate Statistics by Customer Segment:")
segment_stats = final_electricity_fact.groupBy("customer_segment").agg(
    avg("rate_per_kwh").alias("avg_rate"),
    min("rate_per_kwh").alias("min_rate"),
    max("rate_per_kwh").alias("max_rate"),
    stddev("rate_per_kwh").alias("stddev_rate"),
    count("*").alias("record_count")
).orderBy("customer_segment")
segment_stats.show()

# Location distribution (top 10)
print("\nTop 10 Locations by Record Count:")
location_dist = final_electricity_fact.join(
    dim_location.select("location_id", "location_name"),
    "location_id"
).groupBy("location_name").count().orderBy(desc("count")).limit(10)
location_dist.show(truncate=False)

# Foreign key validation
print("\nForeign Key Validation:")
final_electricity_fact.agg(
    countDistinct("location_id").alias("unique_locations"),
    countDistinct("date_id").alias("unique_dates"),
    countDistinct("indicator_id").alias("unique_indicators"),
    sum(when(col("location_id").isNull(), 1).otherwise(0)).alias("null_location_ids"),
    sum(when(col("date_id").isNull(), 1).otherwise(0)).alias("null_date_ids"),
    sum(when(col("indicator_id").isNull(), 1).otherwise(0)).alias("null_indicator_ids")
).show()

Electricity Pricing Fact Table - Data Quality Analysis
Total Records: 384

Temporal Coverage:
+----+-----+
|year|count|
+----+-----+
|2023|  192|
|2024|  192|
+----+-----+


Customer Segment Distribution:
+----------------+-----+
|customer_segment|count|
+----------------+-----+
|     Residential|   96|
|      Industrial|   96|
|         Blended|   96|
|      Commercial|   96|
+----------------+-----+


Rate Tier Distribution:
+---------+-----+
|rate_tier|count|
+---------+-----+
|     High|   88|
|      Low|   57|
|   Medium|  239|
+---------+-----+


Rate Statistics by Customer Segment:
+----------------+-----------------+--------+--------+-------------------+------------+
|customer_segment|         avg_rate|min_rate|max_rate|        stddev_rate|record_count|
+----------------+-----------------+--------+--------+-------------------+------------+
|         Blended|8.696041666666668|    7.73|   10.02| 0.5738227842455352|          96|
|      Commercial|9.195937499999998|    8.26|   10.5

## Final Validation

In [16]:
# Final validation with dimension joins
try:
    # Validate the saved table
    test_df = spark.read.format("delta").load(fact_electricity_path)
    count = test_df.count()
    print(f"\nValidation: Successfully created fact_electricity_pricing with {count:,} records")
    
    # Test a sample query joining with dimensions
    print("\nSample query with dimension joins:")
    sample_query = test_df.join(
        dim_location.select("location_id", "location_name"),
        "location_id"
    ).join(
        dim_time.select("date_id", "year", "month"),
        "date_id"
    ).join(
        dim_indicator.select("indicator_id", "indicator_name"),
        "indicator_id"
    ).select(
        "location_name", "year", "month", "customer_segment", "rate_per_kwh", 
        "unit_of_measure", "rate_tier", "indicator_name"
    ).limit(5)
    
    sample_query.show(truncate=False)
    
    print("\nPartition validation:")
    partition_count = test_df.select("year", "customer_segment").distinct().count()
    print(f"Total partitions created: {partition_count}")
    
except Exception as e:
    print(f"Validation failed: {e}")


Validation: Successfully created fact_electricity_pricing with 384 records

Sample query with dimension joins:
Validation failed: [AMBIGUOUS_REFERENCE] Reference `year` is ambiguous, could be: [`year`, `year`].


In [17]:
# Summary and cleanup
print(f"\n{'='*70}")
print("ELECTRICITY PRICING FACT TABLE PROCESSING SUMMARY")
print(f"{'='*70}")
print(f"Processing completed: {PROCESSING_TIMESTAMP}")
print(f"Total fact records: {len(clean_electricity_facts):,}")
print(f"Customer segments: 4 (Residential, Commercial, Industrial, Blended)")
print(f"Locations: 12 (National and regional coverage)")
print(f"Time periods: 8 (2023-2024, quarterly)")
print(f"Partitions: 8 (2 years x 4 segments)")
print(f"Output path: {fact_electricity_path}")
print("Electricity pricing fact table ready for analysis!")

# Stop Spark session
spark.stop()
print("\nSpark session stopped.")


ELECTRICITY PRICING FACT TABLE PROCESSING SUMMARY
Processing completed: 2025-08-19 02:17:55.735794
Total fact records: 384
Customer segments: 4 (Residential, Commercial, Industrial, Blended)
Locations: 12 (National and regional coverage)
Time periods: 8 (2023-2024, quarterly)
Partitions: 8 (2 years x 4 segments)
Output path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/fact_electricity_pricing
Electricity pricing fact table ready for analysis!

Spark session stopped.
