# Time Dimension Processor

Creates the time dimension table for the Philippine socioeconomic data medallion architecture.
Generates comprehensive time records from 1980-2030 with Philippine fiscal year support.

**Output**: `dim_time` with calendar and fiscal year hierarchies

In [1]:
import os
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import json
from datetime import datetime, date

In [2]:
# Initialize Spark Session with proper Delta Lake configuration
spark = SparkSession.builder \
    .appName("TimeDimensionProcessor") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
    .config("spark.sql.warehouse.dir", "/tmp/spark-warehouse") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print(f"Spark Version: {spark.version}")
print(f"Application: {spark.sparkContext.appName}")
print(f"Delta Lake support: {spark.conf.get('spark.sql.extensions')}")

your 131072x1 screen size is bogus. expect trouble
25/08/18 21:35:48 WARN Utils: Your hostname, 3rnese resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/08/18 21:35:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/ernese/miniconda3/envs/SO/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/ernese/.ivy2/cache
The jars for the packages stored in: /home/ernese/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-638db4d0-311f-471f-ad71-885ae723fefc;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.4.0 in central
	found io.delta#delta-storage;2.4.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 171ms :: artifacts dl 9ms
	:: modules in use:
	io.delta#delta-core_2.12;2.4.0 from central in [default]
	io.delta#delta-storage;2.4.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |

Spark Version: 3.4.0
Application: TimeDimensionProcessor
Delta Lake support: io.delta.sql.DeltaSparkSessionExtension


In [3]:
# Configuration
SILVER_PATH = "/home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver"
PROCESSING_TIMESTAMP = datetime.now()

os.makedirs(SILVER_PATH, exist_ok=True)

print(f"Silver Path: {SILVER_PATH}")
print(f"Processing Time: {PROCESSING_TIMESTAMP}")

Silver Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver
Processing Time: 2025-08-18 21:35:52.133873


## Generate Time Dimension Data

In [4]:
def generate_comprehensive_time_dimension(start_year=1980, end_year=2030):
    """Generate comprehensive time dimension for Philippine fiscal year"""
    time_records = []
    date_id = 1
    
    print(f"Generating time dimension from {start_year} to {end_year}...")
    print(f"Philippine fiscal year: April to March")
    
    # Month and quarter mappings
    month_names = [
        "January", "February", "March", "April", "May", "June",
        "July", "August", "September", "October", "November", "December"
    ]
    
    quarter_names = ["Q1", "Q2", "Q3", "Q4"]
    
    # Fiscal quarter names
    fiscal_quarter_names = ["FQ1", "FQ2", "FQ3", "FQ4"]
    
    for year in range(start_year, end_year + 1):
        for month in range(1, 13):
            # First day of each month as the representative date
            date_value = date(year, month, 1)
            
            # Calendar quarter (standard)
            quarter = (month - 1) // 3 + 1
            
            # Philippine fiscal year calculation (April to March)
            # Fiscal year starts in April
            if month >= 4:  # April to December
                fiscal_year = year
                fiscal_quarter = (month - 4) // 3 + 1
            else:  # January to March
                fiscal_year = year - 1
                fiscal_quarter = (month + 8) // 3 + 1
            
            # Fiscal month (April = 1, March = 12)
            if month >= 4:
                fiscal_month = month - 3
            else:
                fiscal_month = month + 9
            
            # Leap year calculation
            is_leap_year = (year % 4 == 0 and year % 100 != 0) or (year % 400 == 0)
            
            # Determine season (Northern Hemisphere - applicable to Philippines)
            if month in [12, 1, 2]:
                season = "Dry Season"  # Philippines dry season
            elif month in [3, 4, 5]:
                season = "Hot Dry Season"
            elif month in [6, 7, 8, 9, 10, 11]:
                season = "Wet Season"  # Philippines rainy season
            
            # Week of year (approximate)
            week_of_year = ((month - 1) * 4) + 1
            
            time_records.append({
                'date_id': date_id,
                'date_value': date_value,
                'year': year,
                'quarter': quarter,
                'month': month,
                'month_name': month_names[month - 1],
                'quarter_name': quarter_names[quarter - 1],
                'fiscal_year': fiscal_year,
                'fiscal_quarter': fiscal_quarter,
                'fiscal_quarter_name': fiscal_quarter_names[fiscal_quarter - 1],
                'fiscal_month': fiscal_month,
                'is_leap_year': is_leap_year,
                'season': season,
                'week_of_year': week_of_year,
                'is_current_month': (year == datetime.now().year and month == datetime.now().month),
                'created_at': PROCESSING_TIMESTAMP
            })
            
            date_id += 1
            
        # Progress indicator
        if (year - start_year + 1) % 10 == 0:
            print(f"Generated data for {year} ({year - start_year + 1}/{end_year - start_year + 1} years)")
    
    print(f"Completed: Generated {len(time_records):,} time records")
    return time_records

time_data = generate_comprehensive_time_dimension()
print(f"\nTotal time records generated: {len(time_data):,}")

Generating time dimension from 1980 to 2030...
Philippine fiscal year: April to March
Generated data for 1989 (10/51 years)
Generated data for 1999 (20/51 years)
Generated data for 2009 (30/51 years)
Generated data for 2019 (40/51 years)
Generated data for 2029 (50/51 years)
Completed: Generated 612 time records

Total time records generated: 612


In [5]:
# Show sample data for validation
print("Sample time records:")
for i, record in enumerate(time_data[:12]):  # Show first 12 months
    print(f"{record['year']}-{record['month']:02d}: {record['month_name'][:3]} | "
          f"Q{record['quarter']} | FY{record['fiscal_year']} FQ{record['fiscal_quarter']} | "
          f"{record['season']}")

print("\n...")

# Show fiscal year transition example
print("\nFiscal year transition example (2023-2024):")
transition_records = [r for r in time_data if r['year'] in [2023, 2024] and r['month'] in [3, 4, 5]]
for record in transition_records[:6]:
    print(f"{record['year']}-{record['month']:02d} {record['month_name']}: "
          f"Calendar Q{record['quarter']} | Fiscal FY{record['fiscal_year']} FQ{record['fiscal_quarter']} "
          f"({record['fiscal_quarter_name']})")

Sample time records:
1980-01: Jan | Q1 | FY1979 FQ4 | Dry Season
1980-02: Feb | Q1 | FY1979 FQ4 | Dry Season
1980-03: Mar | Q1 | FY1979 FQ4 | Hot Dry Season
1980-04: Apr | Q2 | FY1980 FQ1 | Hot Dry Season
1980-05: May | Q2 | FY1980 FQ1 | Hot Dry Season
1980-06: Jun | Q2 | FY1980 FQ1 | Wet Season
1980-07: Jul | Q3 | FY1980 FQ2 | Wet Season
1980-08: Aug | Q3 | FY1980 FQ2 | Wet Season
1980-09: Sep | Q3 | FY1980 FQ2 | Wet Season
1980-10: Oct | Q4 | FY1980 FQ3 | Wet Season
1980-11: Nov | Q4 | FY1980 FQ3 | Wet Season
1980-12: Dec | Q4 | FY1980 FQ3 | Dry Season

...

Fiscal year transition example (2023-2024):
2023-03 March: Calendar Q1 | Fiscal FY2022 FQ4 (FQ4)
2023-04 April: Calendar Q2 | Fiscal FY2023 FQ1 (FQ1)
2023-05 May: Calendar Q2 | Fiscal FY2023 FQ1 (FQ1)
2024-03 March: Calendar Q1 | Fiscal FY2023 FQ4 (FQ4)
2024-04 April: Calendar Q2 | Fiscal FY2024 FQ1 (FQ1)
2024-05 May: Calendar Q2 | Fiscal FY2024 FQ1 (FQ1)


## Create Time Dimension DataFrame

In [6]:
# Create time dimension DataFrame
time_df = spark.createDataFrame(time_data)

print(f"Time dimension DataFrame created: {time_df.count():,} records")

# Show schema
print("\nTime dimension schema:")
time_df.printSchema()

# Show sample data
print("\nSample time dimension records:")
time_df.filter((col("year") >= 2023) & (col("year") <= 2024)).show(12, truncate=False)

                                                                                

Time dimension DataFrame created: 612 records

Time dimension schema:
root
 |-- created_at: timestamp (nullable = true)
 |-- date_id: long (nullable = true)
 |-- date_value: date (nullable = true)
 |-- fiscal_month: long (nullable = true)
 |-- fiscal_quarter: long (nullable = true)
 |-- fiscal_quarter_name: string (nullable = true)
 |-- fiscal_year: long (nullable = true)
 |-- is_current_month: boolean (nullable = true)
 |-- is_leap_year: boolean (nullable = true)
 |-- month: long (nullable = true)
 |-- month_name: string (nullable = true)
 |-- quarter: long (nullable = true)
 |-- quarter_name: string (nullable = true)
 |-- season: string (nullable = true)
 |-- week_of_year: long (nullable = true)
 |-- year: long (nullable = true)


Sample time dimension records:
+--------------------------+-------+----------+------------+--------------+-------------------+-----------+----------------+------------+-----+----------+-------+------------+--------------+------------+----+
|created_at      

In [7]:
# Add some analytical columns
time_df_enhanced = time_df.withColumn(
    "year_month", 
    concat(col("year").cast("string"), lit("-"), 
           lpad(col("month").cast("string"), 2, "0"))
).withColumn(
    "fiscal_year_quarter",
    concat(lit("FY"), col("fiscal_year").cast("string"), lit(" "), col("fiscal_quarter_name"))
).withColumn(
    "calendar_year_quarter",
    concat(col("year").cast("string"), lit(" "), col("quarter_name"))
).withColumn(
    "is_fiscal_year_start",
    (col("month") == 4).cast("boolean")
).withColumn(
    "is_fiscal_year_end",
    (col("month") == 3).cast("boolean")
).withColumn(
    "days_in_month",
    when(col("month") == 2, 
         when(col("is_leap_year"), 29).otherwise(28))
    .when(col("month").isin([4, 6, 9, 11]), 30)
    .otherwise(31)
)

print("Enhanced time dimension with analytical columns")
time_df_enhanced.select(
    "date_id", "year_month", "fiscal_year_quarter", "calendar_year_quarter", 
    "season", "is_fiscal_year_start", "is_fiscal_year_end", "days_in_month"
).show(10, truncate=False)

Enhanced time dimension with analytical columns
+-------+----------+-------------------+---------------------+--------------+--------------------+------------------+-------------+
|date_id|year_month|fiscal_year_quarter|calendar_year_quarter|season        |is_fiscal_year_start|is_fiscal_year_end|days_in_month|
+-------+----------+-------------------+---------------------+--------------+--------------------+------------------+-------------+
|1      |1980-01   |FY1979 FQ4         |1980 Q1              |Dry Season    |false               |false             |31           |
|2      |1980-02   |FY1979 FQ4         |1980 Q1              |Dry Season    |false               |false             |29           |
|3      |1980-03   |FY1979 FQ4         |1980 Q1              |Hot Dry Season|false               |true              |31           |
|4      |1980-04   |FY1980 FQ1         |1980 Q2              |Hot Dry Season|true                |false             |30           |
|5      |1980-05   |FY1980 F

## Data Quality Validation

In [8]:
# Data quality validation
print("Data Quality Validation:")
print(f"Total records: {time_df_enhanced.count():,}")
print(f"Expected records (51 years × 12 months): {51 * 12:,}")
print(f"Date range: {time_df_enhanced.agg(min('year')).collect()[0][0]} to {time_df_enhanced.agg(max('year')).collect()[0][0]}")

# Check for duplicates
duplicate_count = time_df_enhanced.groupBy("year", "month").count().filter(col("count") > 1).count()
print(f"Duplicate year-month combinations: {duplicate_count}")

# Check fiscal year logic
print("\nFiscal Year Logic Validation:")
fy_validation = time_df_enhanced.filter(col("year") == 2023).select(
    "month", "month_name", "fiscal_year", "fiscal_quarter", "fiscal_month"
).orderBy("month")

print("2023 Calendar Year fiscal mappings:")
fy_validation.show(12, truncate=False)

# Fiscal year transition validation
print("\nFiscal Year Transition Validation:")
transition_check = time_df_enhanced.filter(
    ((col("year") == 2023) & (col("month") == 3)) |
    ((col("year") == 2023) & (col("month") == 4))
).select("year", "month", "month_name", "fiscal_year", "fiscal_quarter", "is_fiscal_year_end", "is_fiscal_year_start")

transition_check.show(truncate=False)

Data Quality Validation:
Total records: 612
Expected records (51 years × 12 months): 612
Date range: 1980 to 2030
Duplicate year-month combinations: 0

Fiscal Year Logic Validation:
2023 Calendar Year fiscal mappings:
+-----+----------+-----------+--------------+------------+
|month|month_name|fiscal_year|fiscal_quarter|fiscal_month|
+-----+----------+-----------+--------------+------------+
|1    |January   |2022       |4             |10          |
|2    |February  |2022       |4             |11          |
|3    |March     |2022       |4             |12          |
|4    |April     |2023       |1             |1           |
|5    |May       |2023       |1             |2           |
|6    |June      |2023       |1             |3           |
|7    |July      |2023       |2             |4           |
|8    |August    |2023       |2             |5           |
|9    |September |2023       |2             |6           |
|10   |October   |2023       |3             |7           |
|11   |November

## Save Time Dimension

In [9]:
# Save time dimension
dim_time_path = os.path.join(SILVER_PATH, "dim_time")

try:
    time_df_enhanced.write \
        .format("delta") \
        .mode("overwrite") \
        .option("overwriteSchema", "true") \
        .save(dim_time_path)
    
    print(f"\nTime dimension saved successfully!")
    print(f"Path: {dim_time_path}")
    print(f"Records: {time_df_enhanced.count():,}")
    
except Exception as e:
    print(f"Error saving time dimension: {e}")
    # Try saving as parquet if delta fails
    try:
        parquet_path = dim_time_path + "_parquet"
        time_df_enhanced.write.format("parquet").mode("overwrite").save(parquet_path)
        print(f"Saved as parquet instead: {parquet_path}")
    except Exception as e2:
        print(f"Failed to save as parquet too: {e2}")
        raise

                                                                                


Time dimension saved successfully!
Path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/dim_time
Records: 612


## Generate Time Dimension Statistics

In [10]:
# Generate comprehensive statistics
print("Time Dimension Statistics:")
print("\n1. Year Distribution:")
year_stats = time_df_enhanced.groupBy("year").count().orderBy("year")
year_stats.filter((col("year") % 5 == 0) | (col("year").isin([1980, 2024, 2030]))).show()

print("\n2. Fiscal Year Distribution:")
fy_stats = time_df_enhanced.groupBy("fiscal_year").count().orderBy("fiscal_year")
fy_stats.filter((col("fiscal_year") % 5 == 0) | (col("fiscal_year").isin([1979, 2023, 2029]))).show()

print("\n3. Season Distribution:")
time_df_enhanced.groupBy("season").count().show()

print("\n4. Leap Year Count:")
leap_year_count = time_df_enhanced.filter(col("is_leap_year") == True).select("year").distinct().count()
total_years = time_df_enhanced.select("year").distinct().count()
print(f"Leap years: {leap_year_count} out of {total_years} years")

print("\n5. Fiscal Quarter Distribution:")
time_df_enhanced.groupBy("fiscal_quarter", "fiscal_quarter_name").count().orderBy("fiscal_quarter").show()

print("\n6. Key Date Flags:")
fiscal_year_starts = time_df_enhanced.filter(col("is_fiscal_year_start") == True).count()
fiscal_year_ends = time_df_enhanced.filter(col("is_fiscal_year_end") == True).count()
current_months = time_df_enhanced.filter(col("is_current_month") == True).count()

print(f"Fiscal year starts (April): {fiscal_year_starts}")
print(f"Fiscal year ends (March): {fiscal_year_ends}")
print(f"Current month flags: {current_months}")

Time Dimension Statistics:

1. Year Distribution:
+----+-----+
|year|count|
+----+-----+
|1980|   12|
|1985|   12|
|1990|   12|
|1995|   12|
|2000|   12|
|2005|   12|
|2010|   12|
|2015|   12|
|2020|   12|
|2024|   12|
|2025|   12|
|2030|   12|
+----+-----+


2. Fiscal Year Distribution:
+-----------+-----+
|fiscal_year|count|
+-----------+-----+
|       1979|    3|
|       1980|   12|
|       1985|   12|
|       1990|   12|
|       1995|   12|
|       2000|   12|
|       2005|   12|
|       2010|   12|
|       2015|   12|
|       2020|   12|
|       2023|   12|
|       2025|   12|
|       2029|   12|
|       2030|    9|
+-----------+-----+


3. Season Distribution:
+--------------+-----+
|        season|count|
+--------------+-----+
|    Wet Season|  306|
|    Dry Season|  153|
|Hot Dry Season|  153|
+--------------+-----+


4. Leap Year Count:
Leap years: 13 out of 51 years

5. Fiscal Quarter Distribution:
+--------------+-------------------+-----+
|fiscal_quarter|fiscal_quarter_name

In [11]:
# Final validation
try:
    # Validate the saved table
    test_df = spark.read.format("delta").load(dim_time_path)
    count = test_df.count()
    print(f"\nValidation: Successfully created dim_time with {count:,} records")
    
    # Show sample data with key columns
    print("\nSample data validation:")
    test_df.select(
        "date_id", "date_value", "year_month", "fiscal_year_quarter", 
        "season", "is_leap_year", "days_in_month"
    ).filter(col("year") == 2024).show(12, truncate=False)
    
except Exception as e:
    print(f"Validation failed: {e}")


Validation: Successfully created dim_time with 612 records

Sample data validation:
+-------+----------+----------+-------------------+--------------+------------+-------------+
|date_id|date_value|year_month|fiscal_year_quarter|season        |is_leap_year|days_in_month|
+-------+----------+----------+-------------------+--------------+------------+-------------+
|529    |2024-01-01|2024-01   |FY2023 FQ4         |Dry Season    |true        |31           |
|530    |2024-02-01|2024-02   |FY2023 FQ4         |Dry Season    |true        |29           |
|531    |2024-03-01|2024-03   |FY2023 FQ4         |Hot Dry Season|true        |31           |
|532    |2024-04-01|2024-04   |FY2024 FQ1         |Hot Dry Season|true        |30           |
|533    |2024-05-01|2024-05   |FY2024 FQ1         |Hot Dry Season|true        |31           |
|534    |2024-06-01|2024-06   |FY2024 FQ1         |Wet Season    |true        |30           |
|535    |2024-07-01|2024-07   |FY2024 FQ2         |Wet Season    |tru

In [12]:
# Create summary report
summary_report = {
    'processing_timestamp': PROCESSING_TIMESTAMP.isoformat(),
    'dimension_type': 'time',
    'date_range': {
        'start_year': 1980,
        'end_year': 2030,
        'total_years': 51
    },
    'records_generated': len(time_data),
    'fiscal_year_type': 'Philippine (April to March)',
    'features': [
        'Calendar year/quarter/month hierarchy',
        'Philippine fiscal year support',
        'Seasonal classification for Philippines',
        'Leap year identification',
        'Fiscal year transition flags',
        'Days in month calculation',
        'Current period identification'
    ],
    'quality_checks': {
        'no_duplicates': duplicate_count == 0,
        'expected_record_count': len(time_data) == (51 * 12),
        'fiscal_year_logic_validated': True
    },
    'output_path': dim_time_path,
    'status': 'completed'
}

# Save summary
summary_path = os.path.join(SILVER_PATH, "dim_time_summary.json")
with open(summary_path, 'w') as f:
    json.dump(summary_report, f, indent=2, default=str)

print(f"\nSummary report saved: {summary_path}")


Summary report saved: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/dim_time_summary.json


In [13]:
# Summary and cleanup
print(f"\n{'='*70}")
print("TIME DIMENSION PROCESSING SUMMARY")
print(f"{'='*70}")
print(f"Processing completed: {PROCESSING_TIMESTAMP}")
print(f"Date range: 1980 to 2030 (51 years)")
print(f"Total records: {len(time_data):,}")
print(f"Fiscal year type: Philippine (April to March)")
print(f"")
print("Key features:")
for feature in summary_report['features']:
    print(f"  - {feature}")
print(f"")
print(f"Quality validation: {'PASSED' if all(summary_report['quality_checks'].values()) else 'FAILED'}")
print(f"Output path: {dim_time_path}")
print(f"")
print("Time dimension ready for fact table joins!")

# Stop Spark session
spark.stop()
print("\nSpark session stopped.")


TIME DIMENSION PROCESSING SUMMARY
Processing completed: 2025-08-18 21:35:52.133873
Date range: 1980 to 2030 (51 years)
Total records: 612
Fiscal year type: Philippine (April to March)

Key features:
  - Calendar year/quarter/month hierarchy
  - Philippine fiscal year support
  - Seasonal classification for Philippines
  - Leap year identification
  - Fiscal year transition flags
  - Days in month calculation
  - Current period identification

Quality validation: PASSED
Output path: /home/ernese/miniconda3/envs/SO/New_SO/final-spark-silver/dim_time

Time dimension ready for fact table joins!

Spark session stopped.
