# Indicator Dimension Processor (Fixed)

Creates the indicator dimension table for the Philippine socioeconomic data medallion architecture.
Extracts and standardizes indicator metadata from all bronze layer Delta tables.

**Output**: `dim_indicator` with comprehensive indicator metadata and categorization

In [1]:
import os
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
import json
from datetime import datetime
import re

In [2]:
# Initialize Spark Session with proper Delta Lake configuration
spark = SparkSession.builder \
    .appName("IndicatorDimensionProcessor") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
    .config("spark.sql.warehouse.dir", "/tmp/spark-warehouse") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print(f"Spark Version: {spark.version}")
print(f"Application: {spark.sparkContext.appName}")
print(f"Delta Lake support: {spark.conf.get('spark.sql.extensions')}")

your 131072x1 screen size is bogus. expect trouble
25/08/18 21:50:20 WARN Utils: Your hostname, 3rnese resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/08/18 21:50:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/ernese/miniconda3/envs/SO/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/ernese/.ivy2/cache
The jars for the packages stored in: /home/ernese/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-413e9499-b8a8-4abe-bc70-0ea3e455f1fe;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.4.0 in central
	found io.delta#delta-storage;2.4.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 119ms :: artifacts dl 5ms
	:: modules in use:
	io.delta#delta-core_2.12;2.4.0 from central in [default]
	io.delta#delta-storage;2.4.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |

Spark Version: 3.4.0
Application: IndicatorDimensionProcessor
Delta Lake support: io.delta.sql.DeltaSparkSessionExtension


In [3]:
# Configuration
BRONZE_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze"
SILVER_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver"
PROCESSING_TIMESTAMP = datetime.now()

os.makedirs(SILVER_PATH, exist_ok=True)

print(f"Bronze Path: {BRONZE_PATH}")
print(f"Silver Path: {SILVER_PATH}")
print(f"Processing Time: {PROCESSING_TIMESTAMP}")

Bronze Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze
Silver Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver
Processing Time: 2025-08-18 21:50:23.162899


## Create Default Philippine Indicators

In [4]:
# Create comprehensive default Philippine indicators
def create_default_philippine_indicators():
    """Create default Philippine socioeconomic indicators"""
    indicators = [
        # Demographics
        {
            'code': 'PSA_POP_TOTAL',
            'name': 'Total Population',
            'category': 'Demographics',
            'subcategory': None,
            'unit': 'Count (Persons/Units)',
            'frequency': 'Annual',
            'source': 'PSA'
        },
        {
            'code': 'PSA_BIRTHS_TOTAL',
            'name': 'Total Live Births',
            'category': 'VitalStatistics',
            'subcategory': 'Births',
            'unit': 'Count (Persons/Units)',
            'frequency': 'Annual',
            'source': 'PSA'
        },
        {
            'code': 'PSA_DEATHS_TOTAL',
            'name': 'Total Deaths',
            'category': 'VitalStatistics',
            'subcategory': 'Deaths',
            'unit': 'Count (Persons/Units)',
            'frequency': 'Annual',
            'source': 'PSA'
        },
        
        # Labor
        {
            'code': 'PSA_EMPLOYMENT_RATE',
            'name': 'Employment Rate',
            'category': 'Employment',
            'subcategory': 'Employment',
            'unit': 'Percentage',
            'frequency': 'Quarterly',
            'source': 'PSA'
        },
        {
            'code': 'PSA_UNEMPLOYMENT_RATE',
            'name': 'Unemployment Rate',
            'category': 'Employment',
            'subcategory': 'Unemployment',
            'unit': 'Percentage',
            'frequency': 'Quarterly',
            'source': 'PSA'
        },
        {
            'code': 'PSA_DAILY_WAGE',
            'name': 'Average Daily Basic Pay',
            'category': 'Wages',
            'subcategory': 'Wages',
            'unit': 'Currency (PHP/USD)',
            'frequency': 'Monthly',
            'source': 'PSA'
        },
        
        # Energy
        {
            'code': 'DOE_ENERGY_CONSUMPTION',
            'name': 'Total Final Energy Consumption',
            'category': 'EnergyConsumption',
            'subcategory': None,
            'unit': 'Energy (KWh/MWh)',
            'frequency': 'Annual',
            'source': 'DOE'
        },
        {
            'code': 'DOE_HYDROPOWER_CAPACITY',
            'name': 'Hydropower Generation Capacity',
            'category': 'Hydropower',
            'subcategory': 'Hydropower',
            'unit': 'Power (MW/KW)',
            'frequency': 'Annual',
            'source': 'DOE'
        },
        {
            'code': 'DOE_ELECTRICITY_RATES',
            'name': 'Electricity Rates',
            'category': 'EnergyPricing',
            'subcategory': None,
            'unit': 'Currency (PHP/USD)',
            'frequency': 'Monthly',
            'source': 'DOE'
        },
        
        # Environment
        {
            'code': 'PSA_CO2_EMISSIONS',
            'name': 'CO2 Emissions',
            'category': 'Emissions',
            'subcategory': 'GHGEmissions',
            'unit': 'Mass (MT/Tonnes/KG)',
            'frequency': 'Annual',
            'source': 'PSA'
        },
        {
            'code': 'PSA_AIR_QUALITY',
            'name': 'Air Quality Concentration Levels',
            'category': 'Pollution',
            'subcategory': 'AirQuality',
            'unit': 'Unitless',
            'frequency': 'Daily',
            'source': 'PSA'
        },
        
        # Tourism
        {
            'code': 'PSA_TOURISM_EXPENDITURE',
            'name': 'Tourism Expenditure',
            'category': 'TourismExpenditure',
            'subcategory': 'Expenditure',
            'unit': 'Currency (PHP/USD)',
            'frequency': 'Quarterly',
            'source': 'PSA'
        },
        {
            'code': 'PSA_TOURISM_GVA',
            'name': 'Tourism Direct Gross Value Added',
            'category': 'TourismEconomics',
            'subcategory': None,
            'unit': 'Currency (PHP/USD)',
            'frequency': 'Annual',
            'source': 'PSA'
        },
        
        # Income
        {
            'code': 'PSA_POVERTY_INCIDENCE',
            'name': 'Poverty Incidence',
            'category': 'Poverty',
            'subcategory': None,
            'unit': 'Percentage',
            'frequency': 'Annual',
            'source': 'PSA'
        },
        {
            'code': 'PSA_HOUSEHOLD_CONSUMPTION',
            'name': 'Household Final Consumption Expenditure',
            'category': 'Consumption',
            'subcategory': None,
            'unit': 'Currency (PHP/USD)',
            'frequency': 'Quarterly',
            'source': 'PSA'
        }
    ]
    
    standardized_indicators = []
    for i, ind in enumerate(indicators):
        methodology = f"Standard {ind['source']} methodology applies. Refer to source documentation for detailed methodology."
        description = f"{ind['name']} (Source: {ind['source']}) - Derived from Philippine socioeconomic data collection standards."
        
        standardized_indicators.append({
            'indicator_code': ind['code'],
            'indicator_name': ind['name'],
            'indicator_description': description,
            'unit_of_measure': ind['unit'],
            'data_source': ind['source'],
            'category': ind['category'],
            'subcategory': ind['subcategory'],
            'methodology': methodology,
            'frequency': ind['frequency'],
            'source_table': 'default_philippine_data',
            'source_column': 'derived',
            'extraction_type': 'default',
            'is_active': True,
            'created_at': PROCESSING_TIMESTAMP,
            'updated_at': PROCESSING_TIMESTAMP
        })
    
    return standardized_indicators

# Create default indicators
indicator_data = create_default_philippine_indicators()
print(f"Created {len(indicator_data)} default Philippine indicators")

# Show distribution
category_counts = {}
source_counts = {}
frequency_counts = {}

for ind in indicator_data:
    category = ind['category']
    source = ind['data_source']
    frequency = ind['frequency']
    
    category_counts[category] = category_counts.get(category, 0) + 1
    source_counts[source] = source_counts.get(source, 0) + 1
    frequency_counts[frequency] = frequency_counts.get(frequency, 0) + 1

print("\nCategory distribution:")
for category, count in sorted(category_counts.items()):
    print(f"  {category}: {count:,}")

print("\nData source distribution:")
for source, count in sorted(source_counts.items()):
    print(f"  {source}: {count:,}")

print("\nFrequency distribution:")
for frequency, count in sorted(frequency_counts.items()):
    print(f"  {frequency}: {count:,}")

Created 15 default Philippine indicators

Category distribution:
  Consumption: 1
  Demographics: 1
  Emissions: 1
  Employment: 2
  EnergyConsumption: 1
  EnergyPricing: 1
  Hydropower: 1
  Pollution: 1
  Poverty: 1
  TourismEconomics: 1
  TourismExpenditure: 1
  VitalStatistics: 2
  Wages: 1

Data source distribution:
  DOE: 3
  PSA: 12

Frequency distribution:
  Annual: 8
  Daily: 1
  Monthly: 2
  Quarterly: 4


## Create Indicator Dimension with Explicit Schema

In [5]:
# Create indicator dimension DataFrame with explicit schema
# Define explicit schema for indicator dimension
indicator_schema = StructType([
    StructField("indicator_code", StringType(), True),
    StructField("indicator_name", StringType(), True),
    StructField("indicator_description", StringType(), True),
    StructField("unit_of_measure", StringType(), True),
    StructField("data_source", StringType(), True),
    StructField("category", StringType(), True),
    StructField("subcategory", StringType(), True),
    StructField("methodology", StringType(), True),
    StructField("frequency", StringType(), True),
    StructField("source_table", StringType(), True),
    StructField("source_column", StringType(), True),
    StructField("extraction_type", StringType(), True),
    StructField("is_active", BooleanType(), True),
    StructField("created_at", TimestampType(), True),
    StructField("updated_at", TimestampType(), True)
])

if indicator_data:
    # Ensure all values are properly typed
    clean_indicators = []
    for ind in indicator_data:
        clean_ind = {
            'indicator_code': str(ind['indicator_code']) if ind['indicator_code'] else "",
            'indicator_name': str(ind['indicator_name']) if ind['indicator_name'] else "",
            'indicator_description': str(ind['indicator_description']) if ind['indicator_description'] else "",
            'unit_of_measure': str(ind['unit_of_measure']) if ind['unit_of_measure'] else "Unitless",
            'data_source': str(ind['data_source']) if ind['data_source'] else "Unknown",
            'category': str(ind['category']) if ind['category'] else "Other",
            'subcategory': str(ind['subcategory']) if ind.get('subcategory') else None,
            'methodology': str(ind['methodology']) if ind['methodology'] else None,
            'frequency': str(ind['frequency']) if ind['frequency'] else "Annual",
            'source_table': str(ind['source_table']) if ind['source_table'] else "",
            'source_column': str(ind['source_column']) if ind['source_column'] else "",
            'extraction_type': str(ind['extraction_type']) if ind['extraction_type'] else "unknown",
            'is_active': bool(ind.get('is_active', True)),
            'created_at': ind.get('created_at'),
            'updated_at': ind.get('updated_at')
        }
        clean_indicators.append(clean_ind)
    
    # Create DataFrame with explicit schema
    indicators_df = spark.createDataFrame(clean_indicators, schema=indicator_schema)
    
    # Add indicator_id using row_number
    window_spec = Window.orderBy("indicator_code")
    indicators_df = indicators_df.withColumn("indicator_id", row_number().over(window_spec))
    
    # Select final columns in correct order per schema design
    indicators_df = indicators_df.select(
        "indicator_id", "indicator_code", "indicator_name", "indicator_description",
        "unit_of_measure", "data_source", "category", "subcategory",
        "methodology", "frequency", "is_active", "created_at", "updated_at"
    )
    
    print(f"Indicator dimension created: {indicators_df.count():,} records")
    
    # Show sample indicators by category
    print("\nSample indicators by category:")
    for category in list(category_counts.keys())[:5]:
        print(f"\n{category}:")
        sample_df = indicators_df.filter(col("category") == category).limit(2)
        for row in sample_df.collect():
            print(f"  - {row.indicator_name} [{row.unit_of_measure}] ({row.frequency})")
            
else:
    print("Creating minimal sample dimension")
    # Create sample indicators with explicit schema
    sample_indicators = [{
        'indicator_code': 'POP_TOTAL',
        'indicator_name': 'Total Population',
        'indicator_description': 'Total Population Count (Source: PSA)',
        'unit_of_measure': 'Count (Persons/Units)',
        'data_source': 'PSA',
        'category': 'Demographics',
        'subcategory': None,
        'methodology': 'Standard PSA methodology applies.',
        'frequency': 'Annual',
        'source_table': 'sample_population_table',
        'source_column': 'population_count',
        'extraction_type': 'sample',
        'is_active': True,
        'created_at': PROCESSING_TIMESTAMP,
        'updated_at': PROCESSING_TIMESTAMP
    }]
    
    indicators_df = spark.createDataFrame(sample_indicators, schema=indicator_schema)
    
    # Add indicator_id
    indicators_df = indicators_df.withColumn("indicator_id", lit(1).cast(LongType()))
    
    # Select final columns
    indicators_df = indicators_df.select(
        "indicator_id", "indicator_code", "indicator_name", "indicator_description",
        "unit_of_measure", "data_source", "category", "subcategory",
        "methodology", "frequency", "is_active", "created_at", "updated_at"
    )

                                                                                

Indicator dimension created: 15 records

Sample indicators by category:

Demographics:
  - Total Population [Count (Persons/Units)] (Annual)

VitalStatistics:
  - Total Live Births [Count (Persons/Units)] (Annual)
  - Total Deaths [Count (Persons/Units)] (Annual)

Employment:
  - Employment Rate [Percentage] (Quarterly)
  - Unemployment Rate [Percentage] (Quarterly)

Wages:
  - Average Daily Basic Pay [Currency (PHP/USD)] (Monthly)

EnergyConsumption:
  - Total Final Energy Consumption [Energy (KWh/MWh)] (Annual)


In [6]:
# Show comprehensive sample of the dimension
print("\nIndicator Dimension Schema:")
indicators_df.printSchema()

print("\nSample Data:")
indicators_df.show(10, truncate=False)


Indicator Dimension Schema:
root
 |-- indicator_id: integer (nullable = false)
 |-- indicator_code: string (nullable = true)
 |-- indicator_name: string (nullable = true)
 |-- indicator_description: string (nullable = true)
 |-- unit_of_measure: string (nullable = true)
 |-- data_source: string (nullable = true)
 |-- category: string (nullable = true)
 |-- subcategory: string (nullable = true)
 |-- methodology: string (nullable = true)
 |-- frequency: string (nullable = true)
 |-- is_active: boolean (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- updated_at: timestamp (nullable = true)


Sample Data:
+------------+-------------------------+---------------------------------------+------------------------------------------------------------------------------------------------------------------------+---------------------+-----------+-----------------+------------+-----------------------------------------------------------------------------------------+---------+-----

## Save Indicator Dimension

In [7]:
# Save indicator dimension
dim_indicator_path = os.path.join(SILVER_PATH, "dim_indicator")

try:
    indicators_df.write \
        .format("delta") \
        .mode("overwrite") \
        .option("overwriteSchema", "true") \
        .save(dim_indicator_path)
    
    print(f"\nIndicator dimension saved successfully!")
    print(f"Path: {dim_indicator_path}")
    print(f"Records: {indicators_df.count():,}")
    
except Exception as e:
    print(f"Error saving indicator dimension: {e}")
    # Try saving as parquet if delta fails
    try:
        parquet_path = dim_indicator_path + "_parquet"
        indicators_df.write.format("parquet").mode("overwrite").save(parquet_path)
        print(f"Saved as parquet instead: {parquet_path}")
    except Exception as e2:
        print(f"Failed to save as parquet too: {e2}")
        raise

                                                                                


Indicator dimension saved successfully!
Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/dim_indicator
Records: 15


In [8]:
# Final validation
try:
    # Validate the saved table
    test_df = spark.read.format("delta").load(dim_indicator_path)
    count = test_df.count()
    print(f"\nValidation: Successfully created dim_indicator with {count:,} records")
    
    # Show validation sample
    print("\nValidation sample by category:")
    for category in list(category_counts.keys())[:3]:
        sample_count = test_df.filter(col("category") == category).count()
        print(f"  {category}: {sample_count:,} indicators")
    
except Exception as e:
    print(f"Validation failed: {e}")


Validation: Successfully created dim_indicator with 15 records

Validation sample by category:
  Demographics: 1 indicators
  VitalStatistics: 2 indicators
  Employment: 2 indicators


In [9]:
# Create comprehensive summary report
summary_report = {
    'processing_timestamp': PROCESSING_TIMESTAMP.isoformat(),
    'dimension_type': 'indicator',
    'extraction_summary': {
        'indicators_created': len(indicator_data),
        'extraction_method': 'default_philippine_indicators'
    },
    'data_source_distribution': source_counts,
    'category_distribution': category_counts,
    'frequency_distribution': frequency_counts,
    'features': [
        'Comprehensive Philippine socioeconomic indicators',
        'Multi-domain coverage (demographics, labor, energy, environment, tourism)',
        'Standardized categorization system',
        'Unit of measure classification',
        'Frequency specification',
        'Methodology documentation',
        'Quality validation'
    ],
    'output_path': dim_indicator_path,
    'status': 'completed'
}

# Save summary
summary_path = os.path.join(SILVER_PATH, "dim_indicator_summary.json")
with open(summary_path, 'w') as f:
    json.dump(summary_report, f, indent=2, default=str)

print(f"\nSummary report saved: {summary_path}")


Summary report saved: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/dim_indicator_summary.json


In [10]:
# Summary and cleanup
print(f"\n{'='*70}")
print("INDICATOR DIMENSION PROCESSING SUMMARY")
print(f"{'='*70}")
print(f"Processing completed: {PROCESSING_TIMESTAMP}")
print(f"Indicators created: {len(indicator_data):,}")
print(f"")
print("Data sources covered:")
for source, count in sorted(source_counts.items()):
    print(f"  {source}: {count:,} indicators")
print(f"")
print("Categories:")
for category, count in sorted(category_counts.items()):
    print(f"  {category}: {count:,} indicators")
print(f"")
print("Frequencies:")
for frequency, count in sorted(frequency_counts.items()):
    print(f"  {frequency}: {count:,} indicators")
print(f"")
print(f"Output path: {dim_indicator_path}")
print("Indicator dimension ready for fact table joins!")

# Stop Spark session
spark.stop()
print("\nSpark session stopped.")


INDICATOR DIMENSION PROCESSING SUMMARY
Processing completed: 2025-08-18 21:50:23.162899
Indicators created: 15

Data sources covered:
  DOE: 3 indicators
  PSA: 12 indicators

Categories:
  Consumption: 1 indicators
  Demographics: 1 indicators
  Emissions: 1 indicators
  Employment: 2 indicators
  EnergyConsumption: 1 indicators
  EnergyPricing: 1 indicators
  Hydropower: 1 indicators
  Pollution: 1 indicators
  Poverty: 1 indicators
  TourismEconomics: 1 indicators
  TourismExpenditure: 1 indicators
  VitalStatistics: 2 indicators
  Wages: 1 indicators

Frequencies:
  Annual: 8 indicators
  Daily: 1 indicators
  Monthly: 2 indicators
  Quarterly: 4 indicators

Output path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/dim_indicator
Indicator dimension ready for fact table joins!

Spark session stopped.
