# Economic Indicators Fact Table Processor (Fixed)

Creates the economic indicators fact table for the Philippine socioeconomic data medallion architecture.
Processes World Bank economic data from the new_bronze layer including:
- GDP growth and per capita metrics (M-series)
- CO2 emissions and environmental indicators (G-series)
- Natural resources indicators (P-series)

**Output**: fact_economic_indicators with comprehensive economic metrics by location and time period

In [1]:
import os
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
import json
from datetime import datetime, date
import re

In [2]:
# Initialize Spark Session with proper Delta Lake configuration
spark = SparkSession.builder \
    .appName("EconomicIndicatorsFactProcessor") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "6g") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
    .config("spark.sql.warehouse.dir", "/tmp/spark-warehouse") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print(f"Spark Version: {spark.version}")
print(f"Application: {spark.sparkContext.appName}")
print(f"Delta Lake support: {spark.conf.get('spark.sql.extensions')}")

your 131072x1 screen size is bogus. expect trouble
25/08/19 02:56:31 WARN Utils: Your hostname, 3rnese resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/08/19 02:56:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/ernese/miniconda3/envs/SO/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/ernese/.ivy2/cache
The jars for the packages stored in: /home/ernese/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-8dca5b10-a8aa-43d1-aaa4-a3d713db00ab;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.4.0 in central
	found io.delta#delta-storage;2.4.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 180ms :: artifacts dl 7ms
	:: modules in use:
	io.delta#delta-core_2.12;2.4.0 from central in [default]
	io.delta#delta-storage;2.4.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |

Spark Version: 3.4.0
Application: EconomicIndicatorsFactProcessor
Delta Lake support: io.delta.sql.DeltaSparkSessionExtension


In [3]:
# Configuration
BRONZE_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze"
NEW_BRONZE_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze/new_bronze"
SILVER_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver"
PROCESSING_TIMESTAMP = datetime.now()

os.makedirs(SILVER_PATH, exist_ok=True)

print(f"Bronze Path: {BRONZE_PATH}")
print(f"New Bronze Path: {NEW_BRONZE_PATH}")
print(f"Silver Path: {SILVER_PATH}")
print(f"Processing Time: {PROCESSING_TIMESTAMP}")

Bronze Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze
New Bronze Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze/new_bronze
Silver Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver
Processing Time: 2025-08-19 02:56:32.992723


## Load Dimension Tables

In [4]:
# Load dimension tables for foreign key lookups
try:
    dim_location = spark.read.format("delta").load(os.path.join(SILVER_PATH, "dim_location"))
    dim_time = spark.read.format("delta").load(os.path.join(SILVER_PATH, "dim_time"))
    dim_indicator = spark.read.format("delta").load(os.path.join(SILVER_PATH, "dim_indicator"))
    
    print(f"Loaded dimension tables:")
    print(f"  - dim_location: {dim_location.count():,} records")
    print(f"  - dim_time: {dim_time.count():,} records")
    print(f"  - dim_indicator: {dim_indicator.count():,} records")
    
    # Get Philippines location_id for economic indicators
    philippines_location = dim_location.filter(col("location_name") == "Philippines").first()
    philippines_location_id = philippines_location.location_id if philippines_location else 1
    print(f"\nPhilippines location_id: {philippines_location_id}")
    
except Exception as e:
    print(f"Error loading dimension tables: {e}")
    raise

Loaded dimension tables:


                                                                                

  - dim_location: 34 records


                                                                                

  - dim_time: 612 records
  - dim_indicator: 15 records

Philippines location_id: 28


                                                                                

## Explore Available Economic Data Sources

In [5]:
# Check available economic data in new_bronze
try:
    # List available directories in new_bronze
    import glob
    new_bronze_dirs = glob.glob(os.path.join(NEW_BRONZE_PATH, "*"))
    
    # Filter for economic indicators (M, G, P series)
    economic_dirs = [d for d in new_bronze_dirs if os.path.isdir(d) and any(series in os.path.basename(d) for series in ['M0', 'G0', 'P0'])]
    
    print(f"Available economic indicator datasets in new_bronze ({len(economic_dirs)} total):")
    
    # Categorize by series type
    m_series = [d for d in economic_dirs if '/M0' in d]  # Economic/Monetary indicators
    g_series = [d for d in economic_dirs if '/G0' in d]  # Environmental/GHG indicators
    p_series = [d for d in economic_dirs if '/P0' in d]  # Resource/Population indicators
    
    print(f"\nM-Series (Economic/Monetary): {len(m_series)} datasets")
    for d in m_series:
        print(f"  - {os.path.basename(d)}")
    
    print(f"\nG-Series (Environmental/GHG): {len(g_series)} datasets")
    for d in g_series:
        print(f"  - {os.path.basename(d)}")
    
    print(f"\nP-Series (Resource/Population): {len(p_series)} datasets")
    for d in p_series:
        print(f"  - {os.path.basename(d)}")
        
    # Sample first dataset to understand structure
    if economic_dirs:
        sample_dir = economic_dirs[0]
        print(f"\nSample dataset structure: {os.path.basename(sample_dir)}")
        sample_df = spark.read.parquet(sample_dir)
        print(f"Schema: {sample_df.columns}")
        print(f"Count: {sample_df.count()} records")
        sample_df.show(3, truncate=False)
        
except Exception as e:
    print(f"Error exploring economic data: {e}")
    economic_dirs = []

Available economic indicator datasets in new_bronze (9 total):

M-Series (Economic/Monetary): 5 datasets
  - M01_gdp_growth_annual_pct_1980_2024
  - M01_gni_current_us_1980_2024
  - M06_tax_revenue_as_pct_of_gdp_1980_2024
  - M01_gdp_per_capita_current_us_1980_2024
  - M07_adjusted_savings_consumption_of_fixed_capital_current_us_1980_2024

G-Series (Environmental/GHG): 3 datasets
  - G03_co2_emissions_total_excluding_lulucf_mt_co2e_1980_2024
  - G06_total_greenhouse_gas_emissions_including_lulucf_mt_co2e_1980_2024
  - G01_carbon_dioxide_co2_emissions_from_the_power_industry_energy_sector_in_millions_of_tonnes_of_co2_equivalent_1980_2024

P-Series (Resource/Population): 1 datasets
  - P05_total_natural_resources_rents_1980_2024

Sample dataset structure: M01_gdp_growth_annual_pct_1980_2024
Schema: ['indicator_id', 'indicator_value', 'country_id', 'country_value', 'country_iso3_code', 'date', 'value']
Count: 45 records
+-----------------+---------------------+----------+-------------+---

## Load and Process Economic Indicator Data (FIXED)

In [6]:
def process_economic_data_direct(data_path):
    """Process economic data directly from new_bronze (already in long format)"""
    try:
        # Read the parquet data
        df = spark.read.parquet(data_path)
        
        dataset_name = os.path.basename(data_path)
        print(f"\nProcessing: {dataset_name}")
        print(f"Original shape: {df.count()} rows, {len(df.columns)} columns")
        print(f"Columns: {df.columns}")
        
        # Show sample of the data
        print(f"Sample data from {dataset_name}:")
        df.show(3, truncate=False)
        
        # Extract series code from dataset name
        parts = dataset_name.split('_')
        series_code = parts[0] if parts else 'UNKNOWN'
        
        # Determine indicator details from dataset name
        if 'gdp_growth' in dataset_name:
            indicator_name = 'GDP Growth (annual %)'
            unit_measure = 'Percentage'
            category = 'Economic'
            subcategory = 'GDP'
        elif 'gdp_per_capita' in dataset_name:
            indicator_name = 'GDP per capita (current US$)'
            unit_measure = 'USD'
            category = 'Economic'
            subcategory = 'GDP'
        elif 'gni_current' in dataset_name:
            indicator_name = 'GNI (current US$)'
            unit_measure = 'USD'
            category = 'Economic'
            subcategory = 'GDP'
        elif 'tax_revenue' in dataset_name:
            indicator_name = 'Tax revenue (% of GDP)'
            unit_measure = 'Percentage'
            category = 'Economic'
            subcategory = 'Fiscal'
        elif 'adjusted_savings' in dataset_name:
            indicator_name = 'Adjusted savings: consumption of fixed capital'
            unit_measure = 'USD'
            category = 'Economic'
            subcategory = 'Fiscal'
        elif 'co2_emissions' in dataset_name:
            indicator_name = 'CO2 emissions (metric tons)'
            unit_measure = 'MT CO2'
            category = 'Environmental'
            subcategory = 'Emissions'
        elif 'greenhouse_gas' in dataset_name:
            indicator_name = 'Total greenhouse gas emissions'
            unit_measure = 'MT CO2e'
            category = 'Environmental'
            subcategory = 'Emissions'
        elif 'natural_resources_rents' in dataset_name:
            indicator_name = 'Total natural resources rents (% of GDP)'
            unit_measure = 'Percentage'
            category = 'Economic'
            subcategory = 'Natural Resources'
        else:
            # Use the indicator_value from the data if available
            first_row = df.first()
            if first_row and hasattr(first_row, 'indicator_value') and first_row.indicator_value:
                indicator_name = first_row.indicator_value
            else:
                indicator_name = f'Economic Indicator {series_code}'
            unit_measure = 'Number'
            category = 'Economic'
            subcategory = 'General'
        
        # Data is already in long format - just standardize it
        standardized_df = df.select(
            when(col("country_value").isNotNull(), col("country_value"))
            .otherwise(lit("Philippines")).alias("location_name"),
            col("date").cast(IntegerType()).alias("year"),
            col("value").cast(DoubleType()).alias("indicator_value"),
            lit(indicator_name).alias("indicator_name"),
            when(col("indicator_id").isNotNull(), col("indicator_id"))
            .otherwise(lit(series_code)).alias("indicator_code"),
            lit("World Bank").alias("data_source"),
            lit(dataset_name).alias("source_dataset"),
            lit(unit_measure).alias("unit_of_measure"),
            lit(category).alias("category"),
            lit(subcategory).alias("subcategory")
        ).filter(
            col("indicator_value").isNotNull() & 
            col("year").isNotNull() & 
            (col("year") >= 1980) & 
            (col("year") <= 2024)
        )
        
        record_count = standardized_df.count()
        print(f"  ✓ Processed: {record_count:,} records")
        
        if record_count > 0:
            print(f"  Standardized data:")
            standardized_df.show(3, truncate=False)
        
        return standardized_df
        
    except Exception as e:
        print(f"  ✗ Error processing {data_path}: {e}")
        return None

# Process all economic indicator datasets using direct processing (no transformation needed)
economic_datasets = []

print("Processing economic indicator datasets directly (no transformation needed)...")

for data_path in economic_dirs:
    processed_df = process_economic_data_direct(data_path)
    if processed_df is not None and processed_df.count() > 0:
        economic_datasets.append(processed_df)

print(f"\nSuccessfully processed {len(economic_datasets)} economic datasets")

Processing economic indicator datasets directly (no transformation needed)...

Processing: M01_gdp_growth_annual_pct_1980_2024
Original shape: 45 rows, 7 columns
Columns: ['indicator_id', 'indicator_value', 'country_id', 'country_value', 'country_iso3_code', 'date', 'value']
Sample data from M01_gdp_growth_annual_pct_1980_2024:
+-----------------+---------------------+----------+-------------+-----------------+----+-----+
|indicator_id     |indicator_value      |country_id|country_value|country_iso3_code|date|value|
+-----------------+---------------------+----------+-------------+-----------------+----+-----+
|NY.GDP.MKTP.KD.ZG|GDP growth (annual %)|PH        |Philippines  |PHL              |2024|5.7  |
|NY.GDP.MKTP.KD.ZG|GDP growth (annual %)|PH        |Philippines  |PHL              |2023|5.5  |
|NY.GDP.MKTP.KD.ZG|GDP growth (annual %)|PH        |Philippines  |PHL              |2022|7.6  |
+-----------------+---------------------+----------+-------------+-----------------+----+-----

## Combine and Standardize Economic Data

In [7]:
if economic_datasets:
    print("Combining economic indicator datasets...")
    
    # Union all datasets
    combined_economic_df = economic_datasets[0]
    for df in economic_datasets[1:]:
        combined_economic_df = combined_economic_df.union(df)
    
    # Add additional standardization and quality measures
    final_economic_df = combined_economic_df \
        .withColumn("location_name", 
                    when(col("location_name").isNull(), "Philippines")
                    .otherwise(trim(col("location_name")))) \
        .withColumn("month", lit(12)) \
        .withColumn("quarter", lit(4)) \
        .withColumn("measurement_date", 
                    to_date(concat(col("year"), lit("-12-31")))) \
        .withColumn("frequency", lit("annual")) \
        .withColumn("data_quality_score", 
                    when(col("indicator_value").isNotNull(), 0.95)
                    .otherwise(0.8)) \
        .withColumn("created_at", lit(PROCESSING_TIMESTAMP).cast(TimestampType())) \
        .withColumn("updated_at", lit(PROCESSING_TIMESTAMP).cast(TimestampType())) \
        .filter(col("indicator_value").isNotNull()) \
        .filter(col("year") >= 1980) \
        .filter(col("year") <= 2024)
    
    total_records = final_economic_df.count()
    print(f"Combined economic dataset: {total_records:,} records")
    
    if total_records > 0:
        # Show data distribution
        print("\nData distribution by indicator:")
        final_economic_df.groupBy("indicator_name", "category").agg(
            count("*").alias("record_count"),
            min("year").alias("min_year"),
            max("year").alias("max_year"),
            avg("indicator_value").alias("avg_value")
        ).orderBy("category", "indicator_name").show(truncate=False)
        
        print("\nSample records:")
        final_economic_df.orderBy(desc("year"), "indicator_name").show(10, truncate=False)
    else:
        print("No valid economic data after filtering")
        final_economic_df = None
        
else:
    print("No economic datasets to combine")
    final_economic_df = None

Combining economic indicator datasets...
Combined economic dataset: 364 records

Data distribution by indicator:
+----------------------------------------------+-------------+------------+--------+--------+------------------+
|indicator_name                                |category     |record_count|min_year|max_year|avg_value         |
+----------------------------------------------+-------------+------------+--------+--------+------------------+
|Adjusted savings: consumption of fixed capital|Economic     |42          |1980    |2021    |14.0              |
|GDP Growth (annual %)                         |Economic     |45          |1980    |2024    |3.853333333333332 |
|GDP per capita (current US$)                  |Economic     |45          |1980    |2024    |281.48            |
|GNI (current US$)                             |Economic     |45          |1980    |2024    |176.9111111111111 |
|Tax revenue (% of GDP)                        |Economic     |34          |1990    |2023    |13.

## Add Dimension Foreign Keys

In [8]:
if final_economic_df is not None:
    # Add location foreign keys
    economic_with_location = final_economic_df.join(
        dim_location.select("location_id", "location_name").alias("loc"),
        final_economic_df.location_name == col("loc.location_name"),
        "left"
    ).select(
        final_economic_df["*"],
        col("loc.location_id").alias("location_id")
    )
    
    # Use Philippines location_id as default
    economic_with_location = economic_with_location.withColumn(
        "location_id",
        when(col("location_id").isNull(), philippines_location_id)
        .otherwise(col("location_id"))
    )
    
    print("Added location foreign keys")
    economic_with_location.select("location_name", "location_id").distinct().show()
    
    # Add time foreign keys
    economic_with_time = economic_with_location.join(
        dim_time.select("date_id", "year", "month").alias("time"),
        (economic_with_location.year == col("time.year")) & 
        (economic_with_location.month == col("time.month")),
        "left"
    ).select(
        economic_with_location["*"],
        col("time.date_id").alias("date_id")
    )
    
    # Add default date_id for unmatched dates
    economic_with_time = economic_with_time.withColumn(
        "date_id",
        when(col("date_id").isNull(), lit(1)).otherwise(col("date_id"))
    )
    
    print("Added time foreign keys")
    print(f"Records with time keys: {economic_with_time.count():,}")
    
    # Add indicator foreign keys - use default for now
    economic_with_indicators = economic_with_time.withColumn("indicator_id", lit(1))
    
    print(f"Final records with all foreign keys: {economic_with_indicators.count():,}")
    
else:
    print("No economic data to process foreign keys")
    economic_with_indicators = None

Added location foreign keys
+-------------+-----------+
|location_name|location_id|
+-------------+-----------+
|  Philippines|         28|
+-------------+-----------+

Added time foreign keys
Records with time keys: 364
Final records with all foreign keys: 364


## Create Final Economic Indicators Fact Table

In [9]:
if economic_with_indicators is not None:
    # Create final fact table with proper columns
    final_economic_fact = economic_with_indicators.withColumn(
        "economic_indicator_id",
        row_number().over(Window.orderBy("location_id", "date_id", "indicator_code", "year"))
    ).withColumn(
        "value", col("indicator_value")
    ).withColumn(
        "value_formatted", 
        when(col("indicator_value") > 1000000000, 
             concat(round(col("indicator_value") / 1000000000, 2).cast(StringType()), lit("B")))
        .when(col("indicator_value") > 1000000, 
             concat(round(col("indicator_value") / 1000000, 2).cast(StringType()), lit("M")))
        .when(col("indicator_value") > 1000,
             concat(round(col("indicator_value") / 1000, 2).cast(StringType()), lit("K")))
        .otherwise(round(col("indicator_value"), 2).cast(StringType()))
    ).withColumn(
        "confidence_interval_lower", lit(None).cast(DoubleType())
    ).withColumn(
        "confidence_interval_upper", lit(None).cast(DoubleType())
    ).withColumn(
        "trend_direction",
        when(col("value") > lag(col("value")).over(
            Window.partitionBy("indicator_code", "location_id").orderBy("year")), "Increasing")
        .when(col("value") < lag(col("value")).over(
            Window.partitionBy("indicator_code", "location_id").orderBy("year")), "Decreasing")
        .when(col("value") == lag(col("value")).over(
            Window.partitionBy("indicator_code", "location_id").orderBy("year")), "Stable")
        .otherwise("Unknown")
    ).withColumn(
        "performance_rating",
        when((col("category") == "Economic") & (col("subcategory") == "GDP") & (col("value") > 5), "Excellent")
        .when((col("category") == "Economic") & (col("subcategory") == "GDP") & (col("value") > 3), "Good")
        .when((col("category") == "Economic") & (col("subcategory") == "GDP") & (col("value") > 0), "Fair")
        .when((col("category") == "Economic") & (col("subcategory") == "GDP") & (col("value") <= 0), "Poor")
        .when(col("category") == "Environmental", "Monitor")
        .otherwise("Neutral")
    )
    
    # Select final columns for fact table
    final_economic_fact = final_economic_fact.select(
        "economic_indicator_id",
        "location_id",
        "date_id",
        "indicator_id",
        "measurement_date",
        "year",
        "month",
        "quarter",
        "indicator_code",
        "indicator_name",
        "value",
        "value_formatted",
        "confidence_interval_lower",
        "confidence_interval_upper",
        "unit_of_measure",
        "category",
        "subcategory",
        "frequency",
        "trend_direction",
        "performance_rating",
        "data_quality_score",
        "data_source",
        "source_dataset",
        "created_at",
        "updated_at"
    )
    
    record_count = final_economic_fact.count()
    print(f"Final economic indicators fact table: {record_count:,} records")
    
    if record_count > 0:
        print("\nFinal schema:")
        final_economic_fact.printSchema()
        
        print("\nSample fact records:")
        final_economic_fact.orderBy(desc("year"), "indicator_name").show(5, truncate=False)
    else:
        print("No records in final fact table")
        final_economic_fact = None
        
else:
    print("No economic data available for fact table creation")
    final_economic_fact = None

Final economic indicators fact table: 364 records

Final schema:
root
 |-- economic_indicator_id: integer (nullable = false)
 |-- location_id: integer (nullable = true)
 |-- date_id: long (nullable = true)
 |-- indicator_id: integer (nullable = false)
 |-- measurement_date: date (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = false)
 |-- quarter: integer (nullable = false)
 |-- indicator_code: string (nullable = true)
 |-- indicator_name: string (nullable = false)
 |-- value: double (nullable = true)
 |-- value_formatted: string (nullable = true)
 |-- confidence_interval_lower: double (nullable = true)
 |-- confidence_interval_upper: double (nullable = true)
 |-- unit_of_measure: string (nullable = false)
 |-- category: string (nullable = false)
 |-- subcategory: string (nullable = false)
 |-- frequency: string (nullable = false)
 |-- trend_direction: string (nullable = false)
 |-- performance_rating: string (nullable = false)
 |-- data_quality_sc

## Save Economic Indicators Fact Table

In [10]:
if final_economic_fact is not None and final_economic_fact.count() > 0:
    # Save economic indicators fact table
    fact_economic_path = os.path.join(SILVER_PATH, "fact_economic_indicators")
    
    try:
        final_economic_fact.write \
            .format("delta") \
            .mode("overwrite") \
            .option("overwriteSchema", "true") \
            .partitionBy("year", "category") \
            .save(fact_economic_path)
        
        print(f"\nEconomic indicators fact table saved successfully!")
        print(f"Path: {fact_economic_path}")
        print(f"Records: {final_economic_fact.count():,}")
        
    except Exception as e:
        print(f"Error saving economic indicators fact table: {e}")
        # Try saving as parquet if delta fails
        try:
            parquet_path = fact_economic_path + "_parquet"
            final_economic_fact.write.format("parquet").mode("overwrite").partitionBy("year", "category").save(parquet_path)
            print(f"Saved as parquet instead: {parquet_path}")
        except Exception as e2:
            print(f"Failed to save as parquet too: {e2}")
            raise
else:
    print("No economic indicators fact table to save")
    fact_economic_path = None

                                                                                


Economic indicators fact table saved successfully!
Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/fact_economic_indicators
Records: 364


## Data Quality Analysis

In [11]:
if final_economic_fact is not None and final_economic_fact.count() > 0:
    # Generate comprehensive data quality analysis
    print("Economic Indicators Fact Table - Data Quality Analysis")
    print("=" * 65)
    
    # Basic statistics
    total_records = final_economic_fact.count()
    print(f"Total Records: {total_records:,}")
    
    # Temporal coverage
    print("\nTemporal Coverage:")
    temporal_stats = final_economic_fact.agg(
        min("year").alias("min_year"),
        max("year").alias("max_year"),
        countDistinct("year").alias("unique_years")
    ).collect()[0]
    print(f"Year range: {temporal_stats.min_year} - {temporal_stats.max_year}")
    print(f"Unique years: {temporal_stats.unique_years}")
    
    # Category distribution
    print("\nCategory Distribution:")
    final_economic_fact.groupBy("category").count().orderBy(desc("count")).show()
    
    # Subcategory distribution
    print("\nSubcategory Distribution:")
    final_economic_fact.groupBy("subcategory").count().orderBy(desc("count")).show()
    
    # Indicator distribution
    print("\nIndicators by Record Count:")
    final_economic_fact.groupBy("indicator_name").count().orderBy(desc("count")).show(10, truncate=False)
    
else:
    print("No economic indicators fact table available for quality analysis")

Economic Indicators Fact Table - Data Quality Analysis
Total Records: 364

Temporal Coverage:
Year range: 1980 - 2024
Unique years: 45

Category Distribution:
+-------------+-----+
|     category|count|
+-------------+-----+
|     Economic|  253|
|Environmental|  111|
+-------------+-----+


Subcategory Distribution:
+-----------------+-----+
|      subcategory|count|
+-----------------+-----+
|              GDP|  135|
|        Emissions|  111|
|           Fiscal|   76|
|Natural Resources|   42|
+-----------------+-----+


Indicators by Record Count:
+----------------------------------------------+-----+
|indicator_name                                |count|
+----------------------------------------------+-----+
|CO2 emissions (metric tons)                   |88   |
|GDP Growth (annual %)                         |45   |
|GNI (current US$)                             |45   |
|GDP per capita (current US$)                  |45   |
|Total natural resources rents (% of GDP)      |42   |
|Ad

In [12]:
# Summary and cleanup
print(f"\n{'='*80}")
print("ECONOMIC INDICATORS FACT TABLE PROCESSING SUMMARY")
print(f"{'='*80}")
print(f"Processing completed: {PROCESSING_TIMESTAMP}")

if final_economic_fact is not None and final_economic_fact.count() > 0:
    record_count = final_economic_fact.count()
    unique_indicators = final_economic_fact.select("indicator_code").distinct().count()
    year_range = final_economic_fact.agg(min("year"), max("year")).collect()[0]
    
    print(f"Total fact records: {record_count:,}")
    print(f"Unique indicators: {unique_indicators}")
    print(f"Year coverage: {year_range[0]} - {year_range[1]}")
    print(f"Categories: Economic, Environmental")
    print(f"Subcategories: GDP, Fiscal, Emissions, Natural Resources")
    print(f"Data sources: World Bank M/G/P-series from new_bronze")
    print(f"Output path: {fact_economic_path}")
    print("Economic indicators fact table ready for analysis!")
else:
    print("No economic indicators data was processed")
    print("Possible reasons:")
    print("- World Bank data not found in new_bronze")
    print("- Data transformation errors")
    print("- Schema compatibility issues")
    print("- No valid data in source files")

# Stop Spark session
spark.stop()
print("\nSpark session stopped.")


ECONOMIC INDICATORS FACT TABLE PROCESSING SUMMARY
Processing completed: 2025-08-19 02:56:32.992723
Total fact records: 364
Unique indicators: 9
Year coverage: 1980 - 2024
Categories: Economic, Environmental
Subcategories: GDP, Fiscal, Emissions, Natural Resources
Data sources: World Bank M/G/P-series from new_bronze
Output path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/fact_economic_indicators
Economic indicators fact table ready for analysis!

Spark session stopped.
