# Agricultural Land Fact Table Processor

Creates the agricultural land fact table for the Philippine socioeconomic data medallion architecture.
Processes World Bank agricultural and land use data from the new_bronze layer including:
- Agricultural land as percentage of total land area
- Agricultural irrigated land percentage
- Arable land percentage
- Cereal yield per hectare
- Forest area coverage
- Electric power consumption per capita

**Output**: fact_agricultural_land with comprehensive land use and agricultural metrics

In [1]:
import os
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
import json
from datetime import datetime, date
import re

In [2]:
# Initialize Spark Session with proper Delta Lake configuration
spark = SparkSession.builder \
    .appName("AgriculturalLandFactProcessor") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "6g") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
    .config("spark.sql.warehouse.dir", "/tmp/spark-warehouse") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print(f"Spark Version: {spark.version}")
print(f"Application: {spark.sparkContext.appName}")
print(f"Delta Lake support: {spark.conf.get('spark.sql.extensions')}")

your 131072x1 screen size is bogus. expect trouble
25/08/19 02:50:07 WARN Utils: Your hostname, 3rnese resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/08/19 02:50:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/ernese/miniconda3/envs/SO/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/ernese/.ivy2/cache
The jars for the packages stored in: /home/ernese/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-cba8e40a-671f-43ed-8e54-9e111191a502;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.4.0 in central
	found io.delta#delta-storage;2.4.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 155ms :: artifacts dl 7ms
	:: modules in use:
	io.delta#delta-core_2.12;2.4.0 from central in [default]
	io.delta#delta-storage;2.4.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |

Spark Version: 3.4.0
Application: AgriculturalLandFactProcessor
Delta Lake support: io.delta.sql.DeltaSparkSessionExtension


In [3]:
# Configuration
BRONZE_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze"
NEW_BRONZE_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze/new_bronze"
SILVER_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver"
PROCESSING_TIMESTAMP = datetime.now()

os.makedirs(SILVER_PATH, exist_ok=True)

print(f"Bronze Path: {BRONZE_PATH}")
print(f"New Bronze Path: {NEW_BRONZE_PATH}")
print(f"Silver Path: {SILVER_PATH}")
print(f"Processing Time: {PROCESSING_TIMESTAMP}")

Bronze Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze
New Bronze Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-bronze/new_bronze
Silver Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver
Processing Time: 2025-08-19 02:50:12.502714


## Load Dimension Tables

In [4]:
# Load dimension tables for foreign key lookups
try:
    dim_location = spark.read.format("delta").load(os.path.join(SILVER_PATH, "dim_location"))
    dim_time = spark.read.format("delta").load(os.path.join(SILVER_PATH, "dim_time"))
    dim_indicator = spark.read.format("delta").load(os.path.join(SILVER_PATH, "dim_indicator"))
    
    print(f"Loaded dimension tables:")
    print(f"  - dim_location: {dim_location.count():,} records")
    print(f"  - dim_time: {dim_time.count():,} records")
    print(f"  - dim_indicator: {dim_indicator.count():,} records")
    
    # Get Philippines location_id
    philippines_location = dim_location.filter(col("location_name") == "Philippines").first()
    philippines_location_id = philippines_location.location_id if philippines_location else 1
    print(f"\nPhilippines location_id: {philippines_location_id}")
    
except Exception as e:
    print(f"Error loading dimension tables: {e}")
    raise

Loaded dimension tables:


                                                                                

  - dim_location: 34 records
  - dim_time: 612 records
  - dim_indicator: 15 records

Philippines location_id: 28


## Explore Agricultural Land Data Sources

In [5]:
# Check available L-series (Land/Agricultural) data in new_bronze
try:
    import glob
    
    # Find L-series directories in new_bronze
    new_bronze_dirs = glob.glob(os.path.join(NEW_BRONZE_PATH, "*"))
    l_series_dirs = [d for d in new_bronze_dirs if '/L' in d and any(x in os.path.basename(d) for x in ['L08', 'L09', 'L10', 'L11', 'L12', 'L13', 'L14'])]
    
    print(f"Available L-Series (Land/Agricultural) datasets in new_bronze ({len(l_series_dirs)} total):")
    
    for data_path in sorted(l_series_dirs):
        dataset_name = os.path.basename(data_path)
        print(f"  - {dataset_name}")
        
        # Try to peek at data structure
        try:
            sample_df = spark.read.parquet(data_path).limit(1)
            record_count = spark.read.parquet(data_path).count()
            columns = sample_df.columns
            print(f"    Records: {record_count}, Columns: {len(columns)} ({', '.join(columns)})") 
        except Exception as e:
            print(f"    Error reading: {e}")
    
    if not l_series_dirs:
        print("No L-series directories found. Checking all available datasets...")
        all_dirs = [d for d in new_bronze_dirs if os.path.isdir(d)]
        print(f"\nAll available datasets ({len(all_dirs)} total):")
        for d in sorted(all_dirs)[:15]:  # Show first 15
            print(f"  - {os.path.basename(d)}")
        if len(all_dirs) > 15:
            print(f"  ... and {len(all_dirs) - 15} more")
            
except Exception as e:
    print(f"Error exploring agricultural data: {e}")
    l_series_dirs = []

Available L-Series (Land/Agricultural) datasets in new_bronze (7 total):
  - L08_agricultural_land_pct_of_land_area_1980_2024
    Records: 45, Columns: 7 (indicator_id, indicator_value, country_id, country_value, country_iso3_code, date, value)
  - L09_agricultural_irrigated_land_pct_of_total_agricultural_land_1980_2024
    Records: 45, Columns: 7 (indicator_id, indicator_value, country_id, country_value, country_iso3_code, date, value)
  - L10_arable_land_pct_of_land_area_1980_2024
    Records: 45, Columns: 7 (indicator_id, indicator_value, country_id, country_value, country_iso3_code, date, value)
  - L11_cereal_yield_kg_per_hectare_1980_2024
    Records: 45, Columns: 7 (indicator_id, indicator_value, country_id, country_value, country_iso3_code, date, value)
  - L12_electric_power_consumption_kwh_per_capita_1980_2024
    Records: 45, Columns: 7 (indicator_id, indicator_value, country_id, country_value, country_iso3_code, date, value)
  - L13_forest_area_pct_of_land_area_1980_2024
  

## Load and Process Agricultural Land Data

In [6]:
def load_and_transform_agricultural_data(data_path):
    """Load and transform agricultural data from new_bronze Delta tables"""
    try:
        # Read the parquet data
        df = spark.read.parquet(data_path)
        
        dataset_name = os.path.basename(data_path)
        print(f"\nProcessing: {dataset_name}")
        print(f"Original shape: {df.count()} rows, {len(df.columns)} columns")
        
        # Extract series code from dataset name
        parts = dataset_name.split('_')
        series_code = parts[0] if parts else 'UNKNOWN'
        
        # Determine indicator name and unit from dataset name
        if 'agricultural_land_pct' in dataset_name:
            indicator_name = 'Agricultural Land (% of land area)'
            unit_measure = 'Percentage'
            land_use_type = 'Agricultural'
            measurement_type = 'Percentage Coverage'
        elif 'agricultural_irrigated_land_pct' in dataset_name:
            indicator_name = 'Agricultural Irrigated Land (% of total agricultural land)'
            unit_measure = 'Percentage'
            land_use_type = 'Irrigated'
            measurement_type = 'Percentage Coverage'
        elif 'arable_land_pct' in dataset_name:
            indicator_name = 'Arable Land (% of land area)'
            unit_measure = 'Percentage'
            land_use_type = 'Arable'
            measurement_type = 'Percentage Coverage'
        elif 'cereal_yield_kg' in dataset_name:
            indicator_name = 'Cereal Yield (kg per hectare)'
            unit_measure = 'kg/hectare'
            land_use_type = 'Agricultural'
            measurement_type = 'Productivity'
        elif 'electric_power_consumption_kwh' in dataset_name:
            indicator_name = 'Electric Power Consumption (kWh per capita)'
            unit_measure = 'kWh/capita'
            land_use_type = 'Other'
            measurement_type = 'Energy Consumption'
        elif 'forest_area_pct' in dataset_name:
            indicator_name = 'Forest Area (% of land area)'
            unit_measure = 'Percentage'
            land_use_type = 'Forest'
            measurement_type = 'Percentage Coverage'
        elif 'forest_area_sq_km' in dataset_name:
            indicator_name = 'Forest Area (sq. km)'
            unit_measure = 'Square kilometers'
            land_use_type = 'Forest'
            measurement_type = 'Area'
        else:
            # Use the indicator_value from the data if available
            first_row = df.first()
            if first_row and hasattr(first_row, 'indicator_value') and first_row.indicator_value:
                indicator_name = first_row.indicator_value
            else:
                indicator_name = f'Agricultural Indicator {series_code}'
            unit_measure = 'Number'
            land_use_type = 'Other'
            measurement_type = 'Other'
        
        # Transform data to standard format
        transformed_df = df.select(
            when(col("country_value").isNotNull(), col("country_value"))
            .otherwise(lit("Philippines")).alias("location_name"),
            col("date").cast(IntegerType()).alias("year"),
            col("value").cast(DoubleType()).alias("land_value"),
            lit(indicator_name).alias("indicator_name"),
            when(col("indicator_id").isNotNull(), col("indicator_id"))
            .otherwise(lit(series_code)).alias("indicator_code"),
            lit("World Bank").alias("data_source"),
            lit(dataset_name).alias("source_dataset"),
            lit(unit_measure).alias("unit_of_measure"),
            lit(land_use_type).alias("land_use_type"),
            lit(measurement_type).alias("measurement_type"),
            lit("Agricultural").alias("category"),
            lit("Land Use").alias("subcategory")
        ).filter(
            col("land_value").isNotNull() & 
            col("year").isNotNull() & 
            (col("year") >= 1980) & 
            (col("year") <= 2024)
        )
        
        record_count = transformed_df.count()
        print(f"  ✓ Processed: {record_count:,} records")
        
        if record_count > 0:
            print(f"  Sample data:")
            transformed_df.show(3, truncate=False)
        
        return transformed_df
        
    except Exception as e:
        print(f"  ✗ Error processing {data_path}: {e}")
        return None

# Process all L-series agricultural datasets
agricultural_datasets = []

print("Processing agricultural land datasets...")

for data_path in l_series_dirs:
    transformed_df = load_and_transform_agricultural_data(data_path)
    if transformed_df is not None and transformed_df.count() > 0:
        agricultural_datasets.append(transformed_df)

print(f"\nSuccessfully processed {len(agricultural_datasets)} agricultural datasets")

Processing agricultural land datasets...

Processing: L10_arable_land_pct_of_land_area_1980_2024
Original shape: 45 rows, 7 columns
  ✓ Processed: 43 records
  Sample data:
+-------------+----+----------+----------------------------+--------------+-----------+------------------------------------------+---------------+-------------+-------------------+------------+-----------+
|location_name|year|land_value|indicator_name              |indicator_code|data_source|source_dataset                            |unit_of_measure|land_use_type|measurement_type   |category    |subcategory|
+-------------+----+----------+----------------------------+--------------+-----------+------------------------------------------+---------------+-------------+-------------------+------------+-----------+
|Philippines  |2022|18.7      |Arable Land (% of land area)|AG.LND.ARBL.ZS|World Bank |L10_arable_land_pct_of_land_area_1980_2024|Percentage     |Arable       |Percentage Coverage|Agricultural|Land Use   |
|Ph

## Combine and Standardize Agricultural Data

In [7]:
if agricultural_datasets:
    print("Combining agricultural land datasets...")
    
    # Union all datasets
    combined_agricultural_df = agricultural_datasets[0]
    for df in agricultural_datasets[1:]:
        combined_agricultural_df = combined_agricultural_df.union(df)
    
    # Add additional standardization and quality measures
    final_agricultural_df = combined_agricultural_df \
        .withColumn("location_name", 
                    when(col("location_name").isNull(), "Philippines")
                    .otherwise(trim(col("location_name")))) \
        .withColumn("month", lit(12)) \
        .withColumn("quarter", lit(4)) \
        .withColumn("measurement_date", 
                    to_date(concat(col("year"), lit("-12-31")))) \
        .withColumn("frequency", lit("annual")) \
        .withColumn("data_quality_score", 
                    when(col("land_value").isNotNull() & (col("land_value") >= 0), 0.95)
                    .otherwise(0.7)) \
        .withColumn("created_at", lit(PROCESSING_TIMESTAMP).cast(TimestampType())) \
        .withColumn("updated_at", lit(PROCESSING_TIMESTAMP).cast(TimestampType())) \
        .filter(col("land_value").isNotNull()) \
        .filter(col("year") >= 1980) \
        .filter(col("year") <= 2024)
    
    total_records = final_agricultural_df.count()
    print(f"Combined agricultural dataset: {total_records:,} records")
    
    if total_records > 0:
        # Show data distribution
        print("\nData distribution by indicator:")
        final_agricultural_df.groupBy("indicator_name", "land_use_type").agg(
            count("*").alias("record_count"),
            min("year").alias("min_year"),
            max("year").alias("max_year"),
            avg("land_value").alias("avg_value"),
            min("land_value").alias("min_value"),
            max("land_value").alias("max_value")
        ).orderBy("land_use_type", "indicator_name").show(truncate=False)
        
        print("\nSample records:")
        final_agricultural_df.orderBy(desc("year"), "indicator_name").show(10, truncate=False)
    else:
        print("No valid agricultural data after filtering")
        final_agricultural_df = None
        
else:
    print("No agricultural datasets to combine")
    final_agricultural_df = None

Combining agricultural land datasets...
Combined agricultural dataset: 244 records

Data distribution by indicator:
+----------------------------------------------------------+-------------+------------+--------+--------+------------------+---------+---------+
|indicator_name                                            |land_use_type|record_count|min_year|max_year|avg_value         |min_value|max_value|
+----------------------------------------------------------+-------------+------------+--------+--------+------------------+---------+---------+
|Agricultural Land (% of land area)                        |Agricultural |43          |1980    |2022    |38.83488372093023 |35.6     |42.5     |
|Cereal Yield (kg per hectare)                             |Agricultural |43          |1980    |2022    |2.7674418604651163|2.0      |4.0      |
|Arable Land (% of land area)                              |Arable       |43          |1980    |2022    |17.930232558139537|16.6     |18.7     |
|Forest Area (

## Add Dimension Foreign Keys

In [8]:
if final_agricultural_df is not None:
    # Add location foreign keys
    agricultural_with_location = final_agricultural_df.join(
        dim_location.select("location_id", "location_name").alias("loc"),
        final_agricultural_df.location_name == col("loc.location_name"),
        "left"
    ).select(
        final_agricultural_df["*"],
        col("loc.location_id").alias("location_id")
    )
    
    # Use Philippines location_id as default
    agricultural_with_location = agricultural_with_location.withColumn(
        "location_id",
        when(col("location_id").isNull(), philippines_location_id)
        .otherwise(col("location_id"))
    )
    
    print("Added location foreign keys")
    agricultural_with_location.select("location_name", "location_id").distinct().show()
    
    # Add time foreign keys
    agricultural_with_time = agricultural_with_location.join(
        dim_time.select("date_id", "year", "month").alias("time"),
        (agricultural_with_location.year == col("time.year")) & 
        (agricultural_with_location.month == col("time.month")),
        "left"
    ).select(
        agricultural_with_location["*"],
        col("time.date_id").alias("date_id")
    )
    
    # Add default date_id for unmatched dates
    agricultural_with_time = agricultural_with_time.withColumn(
        "date_id",
        when(col("date_id").isNull(), lit(1)).otherwise(col("date_id"))
    )
    
    print("Added time foreign keys")
    print(f"Records with time keys: {agricultural_with_time.count():,}")
    
    # Add indicator foreign keys - use default for now
    agricultural_with_indicators = agricultural_with_time.withColumn("indicator_id", lit(1))
    
    print(f"Final records with all foreign keys: {agricultural_with_indicators.count():,}")
    
else:
    print("No agricultural data to process foreign keys")
    agricultural_with_indicators = None

Added location foreign keys
+-------------+-----------+
|location_name|location_id|
+-------------+-----------+
|  Philippines|         28|
+-------------+-----------+

Added time foreign keys
Records with time keys: 244
Final records with all foreign keys: 244


## Create Final Agricultural Land Fact Table

In [9]:
if agricultural_with_indicators is not None:
    # Create final fact table with proper structure
    final_agricultural_fact = agricultural_with_indicators.withColumn(
        "agricultural_land_id",
        row_number().over(Window.orderBy("location_id", "date_id", "indicator_code", "year"))
    ).withColumn(
        "value", col("land_value")
    ).withColumn(
        "value_formatted",
        when(col("unit_of_measure") == "Percentage", 
             concat(round(col("land_value"), 2).cast(StringType()), lit("%")))
        .when(col("unit_of_measure").contains("km"),
             concat(round(col("land_value"), 0).cast(StringType()), lit(" sq km")))
        .when(col("unit_of_measure").contains("kg"),
             concat(round(col("land_value"), 1).cast(StringType()), lit(" kg/ha")))
        .when(col("unit_of_measure").contains("kWh"),
             concat(round(col("land_value"), 1).cast(StringType()), lit(" kWh/capita")))
        .otherwise(round(col("land_value"), 2).cast(StringType()))
    ).withColumn(
        "confidence_interval_lower", lit(None).cast(DoubleType())
    ).withColumn(
        "confidence_interval_upper", lit(None).cast(DoubleType())
    ).withColumn(
        "trend_category",
        when(col("land_value") > lag(col("land_value")).over(
            Window.partitionBy("indicator_code", "location_id").orderBy("year")), "Increasing")
        .when(col("land_value") < lag(col("land_value")).over(
            Window.partitionBy("indicator_code", "location_id").orderBy("year")), "Decreasing")
        .when(col("land_value") == lag(col("land_value")).over(
            Window.partitionBy("indicator_code", "location_id").orderBy("year")), "Stable")
        .otherwise("Unknown")
    ).withColumn(
        "sustainability_rating",
        when((col("land_use_type") == "Forest") & (col("land_value") > 30), "Good")
        .when((col("land_use_type") == "Forest") & (col("land_value") > 20), "Fair")
        .when((col("land_use_type") == "Forest") & (col("land_value") <= 20), "Poor")
        .when((col("measurement_type") == "Productivity") & (col("land_value") > 3000), "High")
        .when((col("measurement_type") == "Productivity") & (col("land_value") > 2000), "Medium")
        .when((col("measurement_type") == "Productivity") & (col("land_value") <= 2000), "Low")
        .otherwise("Monitor")
    )
    
    # Select final columns for fact table
    final_agricultural_fact = final_agricultural_fact.select(
        "agricultural_land_id",
        "location_id",
        "date_id",
        "indicator_id",
        "measurement_date",
        "year",
        "month",
        "quarter",
        "indicator_code",
        "indicator_name",
        "value",
        "value_formatted",
        "confidence_interval_lower",
        "confidence_interval_upper",
        "unit_of_measure",
        "land_use_type",
        "measurement_type",
        "trend_category",
        "sustainability_rating",
        "category",
        "subcategory",
        "frequency",
        "data_quality_score",
        "data_source",
        "source_dataset",
        "created_at",
        "updated_at"
    )
    
    record_count = final_agricultural_fact.count()
    print(f"Final agricultural land fact table: {record_count:,} records")
    
    if record_count > 0:
        print("\nFinal schema:")
        final_agricultural_fact.printSchema()
        
        print("\nSample fact records:")
        final_agricultural_fact.orderBy(desc("year"), "indicator_name").show(5, truncate=False)
    else:
        print("No records in final fact table")
        final_agricultural_fact = None
        
else:
    print("No agricultural data available for fact table creation")
    final_agricultural_fact = None

Final agricultural land fact table: 244 records

Final schema:
root
 |-- agricultural_land_id: integer (nullable = false)
 |-- location_id: integer (nullable = true)
 |-- date_id: long (nullable = true)
 |-- indicator_id: integer (nullable = false)
 |-- measurement_date: date (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = false)
 |-- quarter: integer (nullable = false)
 |-- indicator_code: string (nullable = true)
 |-- indicator_name: string (nullable = false)
 |-- value: double (nullable = true)
 |-- value_formatted: string (nullable = true)
 |-- confidence_interval_lower: double (nullable = true)
 |-- confidence_interval_upper: double (nullable = true)
 |-- unit_of_measure: string (nullable = false)
 |-- land_use_type: string (nullable = false)
 |-- measurement_type: string (nullable = false)
 |-- trend_category: string (nullable = false)
 |-- sustainability_rating: string (nullable = false)
 |-- category: string (nullable = false)
 |-- subcate

## Save Agricultural Land Fact Table

In [10]:
if final_agricultural_fact is not None and final_agricultural_fact.count() > 0:
    # Save agricultural land fact table
    fact_agricultural_path = os.path.join(SILVER_PATH, "fact_agricultural_land")
    
    try:
        final_agricultural_fact.write \
            .format("delta") \
            .mode("overwrite") \
            .option("overwriteSchema", "true") \
            .partitionBy("year", "land_use_type") \
            .save(fact_agricultural_path)
        
        print(f"\nAgricultural land fact table saved successfully!")
        print(f"Path: {fact_agricultural_path}")
        print(f"Records: {final_agricultural_fact.count():,}")
        
    except Exception as e:
        print(f"Error saving agricultural land fact table: {e}")
        # Try saving as parquet if delta fails
        try:
            parquet_path = fact_agricultural_path + "_parquet"
            final_agricultural_fact.write.format("parquet").mode("overwrite").partitionBy("year", "land_use_type").save(parquet_path)
            print(f"Saved as parquet instead: {parquet_path}")
        except Exception as e2:
            print(f"Failed to save as parquet too: {e2}")
            raise
else:
    print("No agricultural land fact table to save")
    fact_agricultural_path = None

                                                                                


Agricultural land fact table saved successfully!
Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/fact_agricultural_land
Records: 244


## Data Quality Analysis

In [11]:
if final_agricultural_fact is not None and final_agricultural_fact.count() > 0:
    # Generate comprehensive data quality analysis
    print("Agricultural Land Fact Table - Data Quality Analysis")
    print("=" * 65)
    
    # Basic statistics
    total_records = final_agricultural_fact.count()
    print(f"Total Records: {total_records:,}")
    
    # Temporal coverage
    print("\nTemporal Coverage:")
    temporal_stats = final_agricultural_fact.agg(
        min("year").alias("min_year"),
        max("year").alias("max_year"),
        countDistinct("year").alias("unique_years")
    ).collect()[0]
    print(f"Year range: {temporal_stats.min_year} - {temporal_stats.max_year}")
    print(f"Unique years: {temporal_stats.unique_years}")
    
    # Land use type distribution
    print("\nLand Use Type Distribution:")
    final_agricultural_fact.groupBy("land_use_type").count().orderBy(desc("count")).show()
    
    # Measurement type distribution
    print("\nMeasurement Type Distribution:")
    final_agricultural_fact.groupBy("measurement_type").count().orderBy(desc("count")).show()
    
    # Indicator distribution
    print("\nIndicators by Record Count:")
    final_agricultural_fact.groupBy("indicator_name", "unit_of_measure").count().orderBy(desc("count")).show(truncate=False)
    
    # Value statistics by land use type
    print("\nValue Statistics by Land Use Type:")
    value_stats = final_agricultural_fact.groupBy("land_use_type").agg(
        count("value").alias("record_count"),
        avg("value").alias("avg_value"),
        min("value").alias("min_value"),
        max("value").alias("max_value"),
        stddev("value").alias("stddev_value")
    ).orderBy("land_use_type")
    value_stats.show()
    
    # Trend analysis
    print("\nTrend Category Distribution:")
    final_agricultural_fact.groupBy("trend_category").count().orderBy(desc("count")).show()
    
    # Sustainability rating
    print("\nSustainability Rating Distribution:")
    final_agricultural_fact.groupBy("sustainability_rating").count().orderBy(desc("count")).show()
    
    # Data quality scores
    print("\nData Quality Assessment:")
    quality_stats = final_agricultural_fact.agg(
        avg("data_quality_score").alias("avg_quality_score"),
        sum(when(col("value").isNull(), 1).otherwise(0)).alias("null_values"),
        sum(when(col("value") < 0, 1).otherwise(0)).alias("negative_values"),
        countDistinct("indicator_code").alias("unique_indicators")
    ).collect()[0]
    
    print(f"Average quality score: {quality_stats.avg_quality_score:.3f}")
    print(f"Null values: {quality_stats.null_values:,}")
    print(f"Negative values: {quality_stats.negative_values:,}")
    print(f"Unique indicators: {quality_stats.unique_indicators}")
    
    # Recent trends (last 5 years)
    print("\nRecent Agricultural Trends (2020-2024):")
    recent_data = final_agricultural_fact.filter(col("year") >= 2020) \
        .groupBy("indicator_name", "year") \
        .agg(avg("value").alias("avg_value")) \
        .orderBy("indicator_name", "year")
    recent_data.show(truncate=False)
    
else:
    print("No agricultural land fact table available for quality analysis")

Agricultural Land Fact Table - Data Quality Analysis
Total Records: 244

Temporal Coverage:
Year range: 1980 - 2022
Unique years: 43

Land Use Type Distribution:
+-------------+-----+
|land_use_type|count|
+-------------+-----+
| Agricultural|   86|
|       Forest|   66|
|       Arable|   43|
|        Other|   43|
|    Irrigated|    6|
+-------------+-----+


Measurement Type Distribution:
+-------------------+-----+
|   measurement_type|count|
+-------------------+-----+
|Percentage Coverage|  125|
|       Productivity|   43|
| Energy Consumption|   43|
|               Area|   33|
+-------------------+-----+


Indicators by Record Count:
+----------------------------------------------------------+-----------------+-----+
|indicator_name                                            |unit_of_measure  |count|
+----------------------------------------------------------+-----------------+-----+
|Arable Land (% of land area)                              |Percentage       |43   |
|Cereal Yield

## Final Validation

In [12]:
if fact_agricultural_path is not None:
    # Final validation with dimension joins
    try:
        # Validate the saved table
        test_df = spark.read.format("delta").load(fact_agricultural_path)
        count = test_df.count()
        print(f"\nValidation: Successfully created fact_agricultural_land with {count:,} records")
        
        # Test a sample query joining with dimensions
        print("\nSample query with dimension joins:")
        sample_query = test_df.join(
            dim_location.select("location_id", "location_name"),
            "location_id"
        ).join(
            dim_time.select("date_id", "year", "month"),
            "date_id"
        ).select(
            "location_name", "year", "indicator_name", "value",
            "value_formatted", "unit_of_measure", "land_use_type", "trend_category"
        ).orderBy(desc("year"), "indicator_name").limit(10)
        
        sample_query.show(truncate=False)
        
        print("\nPartition validation:")
        partition_stats = test_df.groupBy("year", "land_use_type").count().orderBy("year", "land_use_type")
        partition_count = partition_stats.count()
        print(f"Total partitions created: {partition_count}")
        partition_stats.show(20)
        
    except Exception as e:
        print(f"Validation failed: {e}")
else:
    print("No agricultural land fact table to validate")


Validation: Successfully created fact_agricultural_land with 244 records

Sample query with dimension joins:
Validation failed: [AMBIGUOUS_REFERENCE] Reference `year` is ambiguous, could be: [`year`, `year`].


In [13]:
# Summary and cleanup
print(f"\n{'='*80}")
print("AGRICULTURAL LAND FACT TABLE PROCESSING SUMMARY")
print(f"{'='*80}")
print(f"Processing completed: {PROCESSING_TIMESTAMP}")

if final_agricultural_fact is not None and final_agricultural_fact.count() > 0:
    record_count = final_agricultural_fact.count()
    unique_indicators = final_agricultural_fact.select("indicator_code").distinct().count()
    year_range = final_agricultural_fact.agg(min("year"), max("year")).collect()[0]
    
    print(f"Total fact records: {record_count:,}")
    print(f"Unique indicators: {unique_indicators}")
    print(f"Year coverage: {year_range[0]} - {year_range[1]}")
    print(f"Land use types: Agricultural, Arable, Forest, Irrigated, Other")
    print(f"Measurement types: Percentage Coverage, Area, Productivity, Energy Consumption")
    print(f"Data sources: World Bank L-series from new_bronze")
    print(f"Output path: {fact_agricultural_path}")
    print("Agricultural land fact table ready for analysis!")
else:
    print("No agricultural land data was processed")
    print("Possible reasons:")
    print("- World Bank L-series data not found in new_bronze")
    print("- Data transformation errors")
    print("- Schema compatibility issues")
    print("- No valid data in source files")

# Stop Spark session
spark.stop()
print("\nSpark session stopped.")


AGRICULTURAL LAND FACT TABLE PROCESSING SUMMARY
Processing completed: 2025-08-19 02:50:12.502714
Total fact records: 244
Unique indicators: 7
Year coverage: 1980 - 2022
Land use types: Agricultural, Arable, Forest, Irrigated, Other
Measurement types: Percentage Coverage, Area, Productivity, Energy Consumption
Data sources: World Bank L-series from new_bronze
Output path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/fact_agricultural_land
Agricultural land fact table ready for analysis!

Spark session stopped.
