# Gold Layer: Agricultural Land Trends Dashboard

## Purpose
Transform fact_agricultural_land (1980-2022) into business-ready land use sustainability dashboard tables.

## Data Sources
- **Input**: final-spark-silver/fact_agricultural_land
- **Dimensions**: dim_time, dim_location, dim_indicator
- **Output**: gold_land_sustainability_summary, gold_land_trends_visualization

In [1]:
# First, install delta-spark if not already installed
import subprocess
import sys

try:
    import delta
    print("Delta Spark already installed")
except ImportError:
    print("Installing Delta Spark...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "delta-spark==2.4.0"])
    print("Delta Spark installation completed")

Delta Spark already installed


In [3]:
# Import required libraries
import os
from datetime import datetime


from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

print("All libraries imported successfully")

All libraries imported successfully


In [4]:
# Initialize Spark session with proper Delta Lake configuration
def create_spark_session():
    try:
        # First attempt: Full Delta Lake configuration
        spark = SparkSession.builder \
            .appName("GoldAgriculturalLandTrends") \
            .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
            .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
            .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
            .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
            .config("spark.sql.adaptive.enabled", "true") \
            .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
            .config("spark.databricks.delta.retentionDurationCheck.enabled", "false") \
            .config("spark.databricks.delta.schema.autoMerge.enabled", "true") \
            .getOrCreate()
        
        # Test Delta Lake functionality
        from delta.tables import DeltaTable
        print("Delta Lake successfully configured")
        return spark, True
        
    except Exception as e:
        print(f"Delta Lake configuration failed: {e}")
        print("Falling back to basic Spark session...")
        
        # Fallback: Basic Spark session
        spark = SparkSession.builder \
            .appName("GoldAgriculturalLandTrends") \
            .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
            .config("spark.sql.adaptive.enabled", "true") \
            .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
            .getOrCreate()
        
        return spark, False

# Create Spark session
spark, delta_available = create_spark_session()

# Set logging level
spark.sparkContext.setLogLevel("WARN")

print(f"Spark session initialized: {spark.version}")
print(f"Delta Lake available: {delta_available}")
print(f"Processing timestamp: {datetime.now()}")

your 131072x1 screen size is bogus. expect trouble
25/08/19 12:54:57 WARN Utils: Your hostname, 3rnese resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/08/19 12:54:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/ernese/miniconda3/envs/SO/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/ernese/.ivy2/cache
The jars for the packages stored in: /home/ernese/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0efa8f20-7fc3-4d74-aab0-d1e4c718738d;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.4.0 in central
	found io.delta#delta-storage;2.4.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 166ms :: artifacts dl 5ms
	:: modules in use:
	io.delta#delta-core_2.12;2.4.0 from central in [default]
	io.delta#delta-storage;2.4.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |

Delta Lake successfully configured
Spark session initialized: 3.4.0
Delta Lake available: True
Processing timestamp: 2025-08-19 12:55:00.067777


In [5]:
# Define paths
SILVER_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver"
GOLD_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-gold"

# Create gold directory if it doesn't exist
os.makedirs(GOLD_PATH, exist_ok=True)

print(f"Silver layer path: {SILVER_PATH}")
print(f"Gold layer path: {GOLD_PATH}")
print(f"Delta Lake support: {delta_available}")

# Check if silver path exists
if os.path.exists(SILVER_PATH):
    print("Silver layer path exists")
    # List silver layer contents
    silver_contents = os.listdir(SILVER_PATH)
    print(f"Silver layer contents: {silver_contents}")
else:
    print("Silver layer path not found")

Silver layer path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver
Gold layer path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-gold
Delta Lake support: True
Silver layer path exists
Silver layer contents: ['dimensions_summary.json', 'dim_indicator_summary.json', 'dim_indicator', 'fact_electricity_pricing', 'gold-schema-draft.md', 'dim_location', 'bronze-overview.md', 'dim_time', 'fact_energy_consumption', 'fact_agricultural_land', 'fact_economic_indicators', 'dim_time_summary.json', 'silver-overview.md', 'fact_climate_weather']


## Step 1: Load Source Data with Format Detection

In [6]:
# Smart data loading function
def load_table(table_name, silver_path):
    table_path = f"{silver_path}/{table_name}"
    
    if not os.path.exists(table_path):
        raise FileNotFoundError(f"Table {table_name} not found at {table_path}")
    
    # Check if it's a Delta table (has _delta_log directory)
    delta_log_path = os.path.join(table_path, "_delta_log")
    
    if os.path.exists(delta_log_path) and delta_available:
        try:
            df = spark.read.format("delta").load(table_path)
            print(f"Loaded {table_name} as Delta format")
            return df
        except Exception as e:
            print(f"Failed to load {table_name} as Delta: {e}")
    
    # Fallback to Parquet
    try:
        df = spark.read.parquet(table_path)
        print(f"Loaded {table_name} as Parquet format")
        return df
    except Exception as e:
        print(f"Failed to load {table_name} as Parquet: {e}")
        raise

# Load silver layer tables
try:
    fact_agricultural_land = load_table("fact_agricultural_land", SILVER_PATH)
    dim_time = load_table("dim_time", SILVER_PATH)
    dim_location = load_table("dim_location", SILVER_PATH)
    dim_indicator = load_table("dim_indicator", SILVER_PATH)
    
    print("\nSuccessfully loaded all silver layer tables")
    
except Exception as e:
    print(f"Error loading silver tables: {str(e)}")
    raise

Loaded fact_agricultural_land as Delta format
Loaded dim_time as Delta format
Loaded dim_location as Delta format
Loaded dim_indicator as Delta format

Successfully loaded all silver layer tables


In [7]:
# Display data validation
print(f"Data Validation:")
print(f"fact_agricultural_land records: {fact_agricultural_land.count():,}")
print(f"dim_time records: {dim_time.count():,}")
print(f"dim_location records: {dim_location.count():,}")
print(f"dim_indicator records: {dim_indicator.count():,}")

# Show schemas
print("\nfact_agricultural_land Schema:")
fact_agricultural_land.printSchema()

print("\nDate Range in Agricultural Land Data:")
date_range = fact_agricultural_land.select(
    min("year").alias("min_year"),
    max("year").alias("max_year"),
    countDistinct("year").alias("total_years")
).collect()[0]

print(f"Years covered: {date_range['min_year']} - {date_range['max_year']} ({date_range['total_years']} years)")

print("\nLand Use Types:")
land_types = fact_agricultural_land.select("land_use_type").distinct().orderBy("land_use_type")
land_types.show()

print("\nSample Data:")
fact_agricultural_land.show(5, truncate=False)

Data Validation:


25/08/19 12:55:17 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

fact_agricultural_land records: 244
dim_time records: 612
dim_location records: 34
dim_indicator records: 15

fact_agricultural_land Schema:
root
 |-- agricultural_land_id: integer (nullable = true)
 |-- location_id: integer (nullable = true)
 |-- date_id: long (nullable = true)
 |-- indicator_id: integer (nullable = true)
 |-- measurement_date: date (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- quarter: integer (nullable = true)
 |-- indicator_code: string (nullable = true)
 |-- indicator_name: string (nullable = true)
 |-- value: double (nullable = true)
 |-- value_formatted: string (nullable = true)
 |-- confidence_interval_lower: double (nullable = true)
 |-- confidence_interval_upper: double (nullable = true)
 |-- unit_of_measure: string (nullable = true)
 |-- land_use_type: string (nullable = true)
 |-- measurement_type: string (nullable = true)
 |-- trend_category: string (nullable = true)
 |-- sustainability_rating: string (nu

                                                                                

Years covered: 1980 - 2022 (43 years)

Land Use Types:
+-------------+
|land_use_type|
+-------------+
| Agricultural|
|       Arable|
|       Forest|
|    Irrigated|
|        Other|
+-------------+


Sample Data:
+--------------------+-----------+-------+------------+----------------+----+-----+-------+-----------------+-------------------------------------------+-----+----------------+-------------------------+-------------------------+-----------------+-------------+-------------------+--------------+---------------------+------------+-----------+---------+------------------+-----------+-------------------------------------------------------+--------------------------+--------------------------+
|agricultural_land_id|location_id|date_id|indicator_id|measurement_date|year|month|quarter|indicator_code   |indicator_name                             |value|value_formatted |confidence_interval_lower|confidence_interval_upper|unit_of_measure  |land_use_type|measurement_type   |trend_catego

## Step 2: Data Preparation and Joining

In [8]:
# Check dimension table schemas
print("Checking dimension table schemas:")

print("\ndim_time columns:")
print(dim_time.columns)

print("\ndim_location columns:")
print(dim_location.columns)

print("\ndim_indicator columns:")
print(dim_indicator.columns)

print("\nfact_agricultural_land columns:")
print(fact_agricultural_land.columns)

Checking dimension table schemas:

dim_time columns:
['created_at', 'date_id', 'date_value', 'fiscal_month', 'fiscal_quarter', 'fiscal_quarter_name', 'fiscal_year', 'is_current_month', 'is_leap_year', 'month', 'month_name', 'quarter', 'quarter_name', 'season', 'week_of_year', 'year', 'year_month', 'fiscal_year_quarter', 'calendar_year_quarter', 'is_fiscal_year_start', 'is_fiscal_year_end', 'days_in_month']

dim_location columns:
['location_id', 'location_code', 'location_name', 'location_type', 'parent_location_id', 'iso_code', 'region_code', 'province_code', 'latitude', 'longitude', 'is_active', 'valid_from', 'valid_to', 'created_at', 'updated_at']

dim_indicator columns:
['indicator_id', 'indicator_code', 'indicator_name', 'indicator_description', 'unit_of_measure', 'data_source', 'category', 'subcategory', 'methodology', 'frequency', 'is_active', 'created_at', 'updated_at']

fact_agricultural_land columns:
['agricultural_land_id', 'location_id', 'date_id', 'indicator_id', 'measureme

In [9]:
# Create base dataset for analysis
base_land_data = fact_agricultural_land.withColumn(
    "quarter", 
    when(col("year").isNotNull(), 
         when(col("year") % 4 == 0, 1)
         .when(col("year") % 4 == 1, 2)
         .when(col("year") % 4 == 2, 3)
         .otherwise(4)
    ).otherwise(1)
).withColumn(
    "fiscal_year",
    col("year")
)

# Join with location dimension if available
if dim_location.count() > 0:
    print("Joining with location dimension...")
    
    # Check if region_code exists, fallback to other location fields
    location_cols = dim_location.columns
    
    if "region_code" in location_cols:
        land_with_location = base_land_data.join(
            dim_location.select("location_id", "location_name", "location_type", "region_code"),
            base_land_data.location_id == dim_location.location_id,
            "left"
        ).withColumnRenamed("region_code", "region")
    elif "region" in location_cols:
        land_with_location = base_land_data.join(
            dim_location.select("location_id", "location_name", "location_type", "region"),
            base_land_data.location_id == dim_location.location_id,
            "left"
        )
    else:
        land_with_location = base_land_data.join(
            dim_location.select("location_id", "location_name", "location_type"),
            base_land_data.location_id == dim_location.location_id,
            "left"
        ).withColumn("region", lit("Philippines"))
else:
    print("No location data available, using default location info")
    land_with_location = base_land_data.withColumn(
        "location_name", lit("Philippines")
    ).withColumn(
        "location_type", lit("National")
    ).withColumn(
        "region", lit("National")
    )

print(f"Base dataset prepared with {land_with_location.count():,} records")

Joining with location dimension...
Base dataset prepared with 244 records


## Step 3: Calculate Trend Metrics

In [10]:
# Calculate rolling averages and trend indicators
window_5yr = Window.partitionBy("land_use_type").orderBy("year").rowsBetween(-4, 0)
window_prev_year = Window.partitionBy("land_use_type").orderBy("year")
window_decade = Window.partitionBy("land_use_type").orderBy("year").rowsBetween(-9, 0)

land_with_trends = land_with_location.withColumn(
    "rolling_5yr_avg", avg("value").over(window_5yr)
).withColumn(
    "prev_year_value", lag("value", 1).over(window_prev_year)
).withColumn(
    "rolling_10yr_avg", avg("value").over(window_decade)
).withColumn(
    "year_over_year_change", col("value") - col("prev_year_value")
).withColumn(
    "year_over_year_pct_change", 
    when(col("prev_year_value") > 0, 
         (col("value") - col("prev_year_value")) / col("prev_year_value") * 100
    ).otherwise(0.0)
)

print("Calculated trend metrics and rolling averages")

Calculated trend metrics and rolling averages


In [11]:
# Create trend classifications and chart formatting
land_with_classification = land_with_trends.withColumn(
    "trend_5yr_direction",
    when(col("year_over_year_pct_change") > 0.5, "Increasing")
    .when(col("year_over_year_pct_change") < -0.5, "Decreasing")
    .otherwise("Stable")
).withColumn(
    "sustainability_category",
    when(col("land_use_type") == "Forest", 
         when(col("year_over_year_pct_change") > 0, "Improving")
         .when(col("year_over_year_pct_change") < -1, "Declining")
         .otherwise("Stable")
    ).when(col("land_use_type") == "Agricultural",
         when(col("year_over_year_pct_change") > 2, "Expanding")
         .when(col("year_over_year_pct_change") < -2, "Contracting")
         .otherwise("Stable")
    ).otherwise("Neutral")
).withColumn(
    "chart_color_hex",
    when(col("land_use_type") == "Forest", "#228B22")
    .when(col("land_use_type") == "Agricultural", "#8B4513")
    .when(col("land_use_type") == "Arable", "#DAA520")
    .when(col("land_use_type") == "Irrigated", "#4682B4")
    .otherwise("#808080")
)

print("Added trend classifications and chart formatting")

# Show sample of processed data
print("\nSample Processed Data:")
land_with_classification.filter(
    (col("year") >= 2018) & (col("land_use_type").isin(["Forest", "Agricultural"]))
).select(
    "year", "land_use_type", "value", "rolling_5yr_avg", 
    "year_over_year_pct_change", "trend_5yr_direction", "sustainability_category"
).orderBy("year", "land_use_type").show(10)

Added trend classifications and chart formatting

Sample Processed Data:
+----+-------------+-----+------------------+-------------------------+-------------------+-----------------------+
|year|land_use_type|value|   rolling_5yr_avg|year_over_year_pct_change|trend_5yr_direction|sustainability_category|
+----+-------------+-----+------------------+-------------------------+-------------------+-----------------------+
|2018| Agricultural| 42.3|26.920000000000005|        957.4999999999999|         Increasing|              Expanding|
|2018| Agricultural|  4.0|              19.3|       -90.54373522458629|         Decreasing|            Contracting|
|2018|       Forest| 71.2|             51.98|       199.15966386554624|         Increasing|              Improving|
|2018|       Forest| 23.9|42.660000000000004|       -66.43258426966293|         Decreasing|              Declining|
|2019| Agricultural| 42.4|             26.98|                    960.0|         Increasing|              Expanding|

## Step 4: Create Gold Table 1 - Land Sustainability Summary

In [12]:
# Create annual sustainability summary
sustainability_summary = land_with_classification.groupBy(
    "year", "fiscal_year"
).agg(
    # Land use percentages
    avg(when(col("land_use_type") == "Agricultural", col("value"))).alias("agricultural_land_pct"),
    avg(when(col("land_use_type") == "Forest", col("value"))).alias("forest_coverage_pct"),
    avg(when(col("land_use_type") == "Arable", col("value"))).alias("arable_land_pct"),
    avg(when(col("land_use_type") == "Irrigated", col("value"))).alias("irrigated_land_pct"),
    
    # Trend indicators
    avg(when(col("land_use_type") == "Forest", col("year_over_year_pct_change"))).alias("forest_annual_change_pct"),
    avg(when(col("land_use_type") == "Agricultural", col("year_over_year_pct_change"))).alias("agricultural_annual_change_pct"),
    
    # Rolling averages
    avg(when(col("land_use_type") == "Forest", col("rolling_5yr_avg"))).alias("forest_5yr_avg_pct"),
    avg(when(col("land_use_type") == "Agricultural", col("rolling_5yr_avg"))).alias("agricultural_5yr_avg_pct")
).withColumn(
    "land_summary_id", monotonically_increasing_id()
).withColumn(
    "deforestation_rate_annual", 
    when(col("forest_annual_change_pct") < 0, abs(col("forest_annual_change_pct"))).otherwise(0.0)
).withColumn(
    "land_sustainability_score",
    (coalesce(col("forest_coverage_pct"), lit(0)) * 0.4 +
     (100 - abs(coalesce(col("agricultural_annual_change_pct"), lit(0)))) * 0.3 +
     coalesce(col("arable_land_pct"), lit(0)) * 0.3)
).withColumn(
    "sustainability_rating",
    when(col("land_sustainability_score") >= 80, "Excellent")
    .when(col("land_sustainability_score") >= 65, "Good")
    .when(col("land_sustainability_score") >= 50, "Fair")
    .otherwise("Poor")
).withColumn(
    "forest_trend_5yr",
    when(coalesce(col("forest_annual_change_pct"), lit(0)) > 0.5, "Increasing")
    .when(coalesce(col("forest_annual_change_pct"), lit(0)) < -0.5, "Decreasing")
    .otherwise("Stable")
).withColumn(
    "agricultural_trend_5yr",
    when(coalesce(col("agricultural_annual_change_pct"), lit(0)) > 1.0, "Increasing")
    .when(coalesce(col("agricultural_annual_change_pct"), lit(0)) < -1.0, "Decreasing")
    .otherwise("Stable")
).withColumn(
    "processing_timestamp", current_timestamp()
).withColumn(
    "data_source", lit("World Bank L-series via Silver Layer")
).orderBy("year")

print(f"Created sustainability summary with {sustainability_summary.count():,} records")

# Show sample data
print("\nSample Sustainability Summary (Recent Years):")
sustainability_summary.filter(col("year") >= 2015).select(
    "year", "agricultural_land_pct", "forest_coverage_pct", "arable_land_pct",
    "land_sustainability_score", "sustainability_rating", "forest_trend_5yr"
).show(10)

Created sustainability summary with 43 records

Sample Sustainability Summary (Recent Years):
+----+---------------------+-------------------+---------------+-------------------------+---------------------+----------------+
|year|agricultural_land_pct|forest_coverage_pct|arable_land_pct|land_sustainability_score|sustainability_rating|forest_trend_5yr|
+----+---------------------+-------------------+---------------+-------------------------+---------------------+----------------+
|2015|                 23.0|               46.8|           18.7|       -74.59857142857143|                 Poor|      Increasing|
|2016|                23.05|              47.05|           18.7|       -74.87017814726839|                 Poor|      Increasing|
|2017|                 23.1|               47.3|           18.7|       -75.14180094786731|                 Poor|      Increasing|
|2018|                23.15|              47.55|           18.7|       -75.41343971631204|                 Poor|      Increasi

## Step 5: Create Gold Table 2 - Land Trends Visualization

In [13]:
# Create chart-optimized land trends table
land_trends_viz = land_with_classification.select(
    monotonically_increasing_id().alias("land_trend_id"),
    col("year"),
    col("quarter"),
    col("fiscal_year"),
    concat(col("year"), lit("-Q"), col("quarter")).alias("chart_period"),
    col("land_use_type"),
    col("value").alias("land_percentage"),
    col("rolling_5yr_avg").alias("trend_line_value"),
    col("year_over_year_change"),
    col("year_over_year_pct_change"),
    col("trend_5yr_direction"),
    col("sustainability_category"),
    col("chart_color_hex"),
    
    # Chart formatting
    when(col("land_use_type") == "Forest", 
         concat(lit("Forest: "), round(col("value"), 1), lit("%"))
    ).when(col("land_use_type") == "Agricultural",
         concat(lit("Agricultural: "), round(col("value"), 1), lit("%"))
    ).when(col("land_use_type") == "Arable",
         concat(lit("Arable: "), round(col("value"), 1), lit("%"))
    ).otherwise(
         concat(col("land_use_type"), lit(": "), round(col("value"), 1), lit("%"))
    ).alias("chart_label"),
    
    # Tooltip text
    concat(
        col("year"), lit(": "),
        col("land_use_type"), lit(" land coverage is "),
        round(col("value"), 1), lit("%"),
        when(coalesce(col("year_over_year_pct_change"), lit(0)) > 0, 
             concat(lit(" (up "), round(col("year_over_year_pct_change"), 1), lit("%)"))
        ).when(coalesce(col("year_over_year_pct_change"), lit(0)) < 0,
             concat(lit(" (down "), round(abs(col("year_over_year_pct_change")), 1), lit("%)"))
        ).otherwise(lit(" (stable)"))
    ).alias("tooltip_text"),
    
    # Trend arrows
    when(coalesce(col("year_over_year_pct_change"), lit(0)) > 0.5, "up")
    .when(coalesce(col("year_over_year_pct_change"), lit(0)) < -0.5, "down")
    .otherwise("stable").alias("trend_arrow"),
    
    col("location_name"),
    current_timestamp().alias("last_updated")
).orderBy("year", "land_use_type")

print(f"Created land trends visualization table with {land_trends_viz.count():,} records")

# Show sample visualization data
print("\nSample Land Trends Visualization (2018-2022):")
land_trends_viz.filter(
    (col("year") >= 2018) & (col("land_use_type").isin(["Forest", "Agricultural", "Arable"]))
).select(
    "year", "land_use_type", "land_percentage", "trend_line_value", 
    "year_over_year_pct_change", "trend_arrow", "chart_color_hex"
).show(15)

Created land trends visualization table with 244 records

Sample Land Trends Visualization (2018-2022):
+----+-------------+---------------+------------------+-------------------------+-----------+---------------+
|year|land_use_type|land_percentage|  trend_line_value|year_over_year_pct_change|trend_arrow|chart_color_hex|
+----+-------------+---------------+------------------+-------------------------+-----------+---------------+
|2018| Agricultural|           42.3|26.920000000000005|        957.4999999999999|         up|        #8B4513|
|2018| Agricultural|            4.0|              19.3|       -90.54373522458629|       down|        #8B4513|
|2018|       Arable|           18.7|              18.7|                      0.0|     stable|        #DAA520|
|2018|       Forest|           71.2|             51.98|       199.15966386554624|         up|        #228B22|
|2018|       Forest|           23.9|42.660000000000004|       -66.43258426966293|       down|        #228B22|
|2019| Agricultu

## Step 6: Data Quality Validation

In [14]:
# Data quality validation
print("Data Quality Validation:")

# Get year range for reporting
year_stats = sustainability_summary.agg(
    min("year").alias("min_year"),
    max("year").alias("max_year"),
    count("year").alias("total_records")
).collect()[0]

min_year = year_stats['min_year']
max_year = year_stats['max_year']
total_records = year_stats['total_records']

print(f"Year range: {min_year} - {max_year} ({total_records} records)")

# Check for nulls in critical fields
critical_fields = ["year", "agricultural_land_pct", "forest_coverage_pct", "land_sustainability_score"]
null_checks = {}

for field in critical_fields:
    if field in sustainability_summary.columns:
        null_count = sustainability_summary.filter(col(field).isNull()).count()
        null_checks[field] = null_count
    else:
        null_checks[field] = "Field not found"

print(f"\nNull values in sustainability summary:")
for field, null_count in null_checks.items():
    print(f"  {field}: {null_count}")

# Sustainability score distribution
print(f"\nSustainability Score Distribution:")
if "sustainability_rating" in sustainability_summary.columns:
    sustainability_summary.groupBy("sustainability_rating").count().orderBy("sustainability_rating").show()
else:
    print("Sustainability rating field not found")

# Recent trends summary
print(f"\nRecent Trends (2015-2022):")
recent_data = sustainability_summary.filter(col("year") >= 2015)

if recent_data.count() > 0:
    recent_trends = recent_data.agg(
        avg("forest_coverage_pct").alias("avg_forest_pct"),
        avg("agricultural_land_pct").alias("avg_agricultural_pct"),
        avg("land_sustainability_score").alias("avg_sustainability_score")
    ).collect()[0]
    
    for metric, value in recent_trends.asDict().items():
        if value is not None:
            print(f"  {metric}: {value:.2f}")
        else:
            print(f"  {metric}: No data")
else:
    print("No recent data available")

print("\nData quality validation completed")

Data Quality Validation:
Year range: 1980 - 2022 (43 records)

Null values in sustainability summary:
  year: 0
  agricultural_land_pct: 0
  forest_coverage_pct: 10
  land_sustainability_score: 0

Sustainability Score Distribution:
+---------------------+-----+
|sustainability_rating|count|
+---------------------+-----+
|                 Poor|   43|
+---------------------+-----+


Recent Trends (2015-2022):
  avg_forest_pct: 47.64
  avg_agricultural_pct: 23.16
  avg_sustainability_score: -75.42

Data quality validation completed


## Step 7: Save Gold Tables as Delta

In [15]:
# Smart save function that handles Delta and Parquet fallback
def save_gold_table(dataframe, table_name, gold_path, delta_available):
    table_path = f"{gold_path}/{table_name}"
    
    if delta_available:
        try:
            # Save as Delta table
            dataframe.write \
                .format("delta") \
                .mode("overwrite") \
                .save(table_path)
            
            print(f"Successfully saved {table_name} as Delta table to {table_path}")
            print(f"   Records: {dataframe.count():,}")
            return True, "delta"
            
        except Exception as e:
            print(f"Failed to save {table_name} as Delta: {e}")
            print("Falling back to Parquet format...")
    
    # Fallback to Parquet
    try:
        dataframe.write \
            .mode("overwrite") \
            .parquet(table_path + "_parquet")
        
        print(f"Successfully saved {table_name} as Parquet to {table_path}_parquet")
        print(f"   Records: {dataframe.count():,}")
        return True, "parquet"
        
    except Exception as e:
        print(f"Failed to save {table_name}: {e}")
        return False, "failed"

# Save gold tables
print("Saving gold tables...")

# Save sustainability summary
success1, format1 = save_gold_table(
    sustainability_summary, 
    "gold_land_sustainability_summary", 
    GOLD_PATH, 
    delta_available
)

# Save trends visualization
success2, format2 = save_gold_table(
    land_trends_viz, 
    "gold_land_trends_visualization", 
    GOLD_PATH, 
    delta_available
)

print(f"\nSave Results:")
print(f"  Sustainability Summary: {'Success' if success1 else 'Failed'} ({format1})")
print(f"  Trends Visualization: {'Success' if success2 else 'Failed'} ({format2})")

Saving gold tables...
Successfully saved gold_land_sustainability_summary as Delta table to /home/ernese/miniconda3/envs/SO/New_SO/final-spark-gold/gold_land_sustainability_summary
   Records: 43
Successfully saved gold_land_trends_visualization as Delta table to /home/ernese/miniconda3/envs/SO/New_SO/final-spark-gold/gold_land_trends_visualization
   Records: 244

Save Results:
  Sustainability Summary: Success (delta)
  Trends Visualization: Success (delta)


## Step 8: Create Processing Report

In [16]:
# Generate processing report
processing_report = {
    "processing_timestamp": datetime.now().isoformat(),
    "notebook_name": "gold_agriculture_land.ipynb",
    "processing_type": "agricultural_land_gold_layer",
    "delta_lake_available": delta_available,
    
    "input_data": {
        "source_table": "fact_agricultural_land",
        "date_range": f"{min_year}-{max_year}" if min_year and max_year else "Unknown",
        "total_years": max_year - min_year + 1 if min_year and max_year else 0,
        "input_records": fact_agricultural_land.count()
    },
    
    "output_tables": {
        "gold_land_sustainability_summary": {
            "path": f"{GOLD_PATH}/gold_land_sustainability_summary",
            "records": sustainability_summary.count(),
            "format": format1,
            "success": success1,
            "purpose": "Annual land sustainability dashboard"
        },
        "gold_land_trends_visualization": {
            "path": f"{GOLD_PATH}/gold_land_trends_visualization",
            "records": land_trends_viz.count(),
            "format": format2,
            "success": success2,
            "purpose": "Chart-optimized land use trends"
        }
    },
    
    "data_quality": {
        "null_checks": null_checks,
        "year_range_complete": True,
        "total_land_types": land_types.count()
    },
    
    "business_value": [
        f"{max_year - min_year + 1 if min_year and max_year else 'Multiple'}-year historical land use trend analysis",
        "Sustainability scoring and rating system",
        "Chart-ready visualization data",
        "Forest coverage and deforestation tracking",
        "Agricultural land productivity insights",
        "Policy impact assessment capabilities"
    ],
    
    "status": "completed"
}

# Save report as JSON
import json
report_path = f"{GOLD_PATH}/agricultural_land_processing_report.json"
with open(report_path, 'w') as f:
    json.dump(processing_report, f, indent=2)

print("Processing Report:")
print(json.dumps(processing_report, indent=2))
print(f"\nReport saved to: {report_path}")

Processing Report:
{
  "processing_timestamp": "2025-08-19T12:56:09.411130",
  "notebook_name": "gold_agriculture_land.ipynb",
  "processing_type": "agricultural_land_gold_layer",
  "delta_lake_available": true,
  "input_data": {
    "source_table": "fact_agricultural_land",
    "date_range": "1980-2022",
    "total_years": 43,
    "input_records": 244
  },
  "output_tables": {
    "gold_land_sustainability_summary": {
      "path": "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-gold/gold_land_sustainability_summary",
      "records": 43,
      "format": "delta",
      "success": true,
      "purpose": "Annual land sustainability dashboard"
    },
    "gold_land_trends_visualization": {
      "path": "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-gold/gold_land_trends_visualization",
      "records": 244,
      "format": "delta",
      "success": true,
      "purpose": "Chart-optimized land use trends"
    }
  },
  "data_quality": {
    "null_checks": {
      "year": 0,
     

## Step 9: Verification and Sample Queries

In [17]:
# Verification function
def load_gold_table(table_name, gold_path, expected_format):
    if expected_format == "delta":
        try:
            return spark.read.format("delta").load(f"{gold_path}/{table_name}")
        except:
            print(f"Failed to load {table_name} as Delta, trying Parquet...")
    
    # Try Parquet backup
    try:
        return spark.read.parquet(f"{gold_path}/{table_name}_parquet")
    except:
        return spark.read.parquet(f"{gold_path}/{table_name}")

# Verify saved tables
print("Verification - Reading saved gold tables:")

try:
    test_sustainability = load_gold_table("gold_land_sustainability_summary", GOLD_PATH, format1)
    print(f"gold_land_sustainability_summary: {test_sustainability.count():,} records")
    
    test_trends = load_gold_table("gold_land_trends_visualization", GOLD_PATH, format2)
    print(f"gold_land_trends_visualization: {test_trends.count():,} records")
    
    print("\nSample Dashboard Queries:")
    
    # Query 1: Recent sustainability trends
    print("\n1. Recent Sustainability Trends (Last 8 years):")
    test_sustainability.filter(col("year") >= max_year - 7).select(
        "year", "forest_coverage_pct", "agricultural_land_pct", 
        "land_sustainability_score", "sustainability_rating"
    ).orderBy(desc("year")).show(8)
    
    # Query 2: Forest coverage trends
    print("\n2. Forest Coverage Chart Data (Last 10 years):")
    forest_data = test_trends.filter(
        (col("land_use_type") == "Forest") & (col("year") >= max_year - 9)
    )
    
    if forest_data.count() > 0:
        forest_data.select(
            "year", "land_percentage", "trend_line_value", 
            "trend_arrow", "chart_label"
        ).orderBy("year").show(10, truncate=False)
    else:
        print("No forest data available for recent years")
    
    # Query 3: Latest land use composition
    print(f"\n3. Land Use Composition ({max_year}):")
    latest_composition = test_trends.filter(col("year") == max_year)
    
    if latest_composition.count() > 0:
        latest_composition.select(
            "land_use_type", "land_percentage", "chart_color_hex", "sustainability_category"
        ).orderBy(desc("land_percentage")).show()
    else:
        print(f"No data available for {max_year}")

except Exception as e:
    print(f"Error during verification: {str(e)}")

print(f"\nGold layer agricultural land trends implementation completed successfully!")
print(f"Tables saved in {format1.upper()} format")

Verification - Reading saved gold tables:
gold_land_sustainability_summary: 43 records
gold_land_trends_visualization: 244 records

Sample Dashboard Queries:

1. Recent Sustainability Trends (Last 8 years):
+----+-------------------+---------------------+-------------------------+---------------------+
|year|forest_coverage_pct|agricultural_land_pct|land_sustainability_score|sustainability_rating|
+----+-------------------+---------------------+-------------------------+---------------------+
|2022| 48.449999999999996|                23.25|       -75.79676470588235|                 Poor|
|2021|               48.2|                23.25|       -75.89676470588235|                 Poor|
|2020|               48.0|                23.25|       -75.97676470588235|                 Poor|
|2019|              47.75|                 23.2|       -75.70509433962265|                 Poor|
|2018|              47.55|                23.15|       -75.41343971631204|                 Poor|
|2017|           

In [18]:
# Stop Spark session
spark.stop()
print("Spark session stopped")

Spark session stopped
