# Silver Layer ETL Process

## Why I Made This Notebook
I want to clean and standardize my energy demand data so it’s ready for analysis and modeling. The bronze layer gave me raw data, but it still has issues like missing values, wrong types, and inconsistent formats. This notebook helps me fix those problems and add useful features.

## Data Cleaning and Standardization
First, I combine the Date and Time columns into a proper timestamp called `datetime`. I convert all the numeric columns (like global active power and voltage) to double, and replace any missing values ("?" or blanks) with nulls. I also change all the column names to snake_case so they’re easier to work with. To keep the data reliable, I remove any rows with incomplete timestamps and filter out impossible values, like negative power readings.

## Feature Engineering
I add new columns to make the data more useful:
* `hour_of_day` – the hour from the timestamp
* `day_of_week` – the day name (like Monday)
* `is_weekend` – true if the day is Saturday or Sunday
* `consumption_kwh` – I calculate this by dividing global active power by 60 to get the energy used in kWh per minute

## Output Table
After all these steps, I save the cleaned and enriched data to the silver layer as `silver_power_consumption`. This makes sure my data is accurate, consistent, and ready for deeper analysis or machine learning.

By doing this, I know my data is trustworthy and easy to use for the next steps in my project.

In [0]:
%run ../config/setup

In [0]:
from pyspark.sql.functions import col, to_timestamp, concat_ws, when, dayofweek, date_format
from pyspark.sql.types import DoubleType


# Read from the bronze layer
df = spark.table(full_path_bronze)

#numeric columns to convert to double
numeric_cols = ["Global_active_power", "Global_reactive_power", "Voltage", 
                "Global_intensity", "Sub_metering_1", "Sub_metering_2", "Sub_metering_3"]

In [0]:
for c in numeric_cols:
    df = df.withColumn(c, when((col(c) == '?') | (col(c) == '') | (col(c).isNull()), None).otherwise(col(c)).cast(DoubleType()))

# converted all column to snake_case
# remove rows with incomplete timestamps
# fileter impossible values (negative power readings) 
df_silver = df \
    .withColumn("datetime", to_timestamp(concat_ws(" ", col("Date"), col("Time")), "d/M/y H:m:s")) \
    .filter(col("datetime").isNotNull()) \
    .filter(col("Global_active_power") >= 0) \
    .withColumnRenamed("Global_active_power", "global_active_power") \
    .withColumnRenamed("Voltage", "voltage") \
    .withColumnRenamed("Global_intensity", "global_intensity") \
    .withColumn("hour_of_day", date_format(col("datetime"), "H").cast("int")) \
    .withColumn("day_of_week", date_format(col("datetime"), "E")) \
    .withColumn("is_weekend", dayofweek(col("datetime")).isin([1, 7])) \
    .withColumn("consumption_kwh", col("global_active_power") / 60) \
    .select(
        "datetime", "hour_of_day", "day_of_week", "is_weekend", "consumption_kwh",
        "global_active_power", "voltage", "global_intensity", 
        "Sub_metering_1", "Sub_metering_2", "Sub_metering_3"
    )

### Write to silver silver layer

In [0]:
df_silver.write.format("delta").mode("overwrite").saveAsTable(full_path_silver)
print(f"Table saved to: {full_path_silver}")