# 1. Raw Data Layer

- **Description:** Store raw data as received from the source without any transformations. This serves as the immutable source of truth.
- **Storage:** CSV files in a structured directory hierarchy

# 2. Staging Data Layer

- **Description:** In this layer, data is cleaned, validated, and transformed into a consistent format. It serves as an intermediary stage before further processing.
- **Processing:** Use Apache Spark or Pandas to read CSV files, clean, and transform the data.
- **Storage:** Store the cleaned and transformed data in Parquet files for efficient querying and processing.

In [52]:
from src.data_processing.data_loader import DataLoader

file_paths = [
    "../data/staging/NSW/2020/*.parquet",
    # "../data/staging/NSW/2021/*.parquet",
    # "../data/staging/NSW/2022/*.parquet",
    # "../data/staging/NSW/2023/*.parquet",
    # "../data/staging/NSW/2024/*.parquet"
]
data_loader = DataLoader()
combined_df = data_loader.read_parquet_files(file_paths)
combined_df.show(truncate=False, n=5)
# combined_df.printSchema()

+------+-----------+-----+----------+-------------------+-----------------+--------+-------------------+------------------+
|REGION|TOTALDEMAND|RRP  |PERIODTYPE|date               |prev_total_demand|prev_rrp|demand_diff        |rrp_diff          |
+------+-----------+-----+----------+-------------------+-----------------+--------+-------------------+------------------+
|NSW1  |8581.98    |48.87|TRADE     |2020-08-01 00:30:00|NULL             |NULL    |NULL               |NULL              |
|NSW1  |8366.86    |40.79|TRADE     |2020-08-01 01:00:00|8581.98          |48.87   |-215.11999999999898|-8.079999999999998|
|NSW1  |8134.4     |41.55|TRADE     |2020-08-01 01:30:00|8366.86          |40.79   |-232.46000000000095|0.759999999999998 |
|NSW1  |7811.92    |44.61|TRADE     |2020-08-01 02:00:00|8134.4           |41.55   |-322.47999999999956|3.0600000000000023|
|NSW1  |7469.27    |48.47|TRADE     |2020-08-01 02:30:00|7811.92          |44.61   |-342.64999999999964|3.8599999999999994|
+------+

# 3. Curated Data Layer

- **Description:** This layer stores data that has been further processed and enriched, optimized for consumption by analytics and machine learning models.
- **Processing:** Perform aggregation, normalization, and feature engineering.
- **Storage:** Store the curated data in Parquet files.

In [53]:
from src.data_processing.data_loader import DataLoader

file_paths = [
    "../data/curated/NSW/2020/*.parquet",
    # "../data/curated/NSW/2021/*.parquet",
    # "../data/curated/NSW/2022/*.parquet",
    # "../data/curated/NSW/2023/*.parquet",
    # "../data/curated/NSW/2024/*.parquet"
]
data_loader = DataLoader()
combined_df = data_loader.read_parquet_files(file_paths)
combined_df.show(truncate=False, n=5)

+-------------------+----------+-------+------------+---------+------------------+
|date               |avg_demand|avg_rrp|total_demand|total_rrp|demand_rrp_ratio  |
+-------------------+----------+-------+------------+---------+------------------+
|2020-01-01 07:00:00|6125.52   |32.66  |6125.52     |32.66    |187.55419473361914|
|2020-01-01 10:00:00|6424.01   |12.92  |6424.01     |12.92    |497.2143962848297 |
|2020-01-04 03:00:00|6536.29   |45.37  |6536.29     |45.37    |144.06634339872164|
|2020-01-07 07:30:00|7809.8    |49.46  |7809.8      |49.46    |157.90133441164576|
|2020-01-09 01:30:00|6692.62   |37.5   |6692.62     |37.5     |178.46986666666666|
+-------------------+----------+-------+------------+---------+------------------+
only showing top 5 rows



# 4. Analytical Data Layer

- **Description:** This layer is optimized for direct querying, anomaly detection and consumption by BI tools and dashboards.
- **Processing:** Create views, precomputed aggregates, and indices for fast query performance.
- **Storage:** Use Parquet files for efficient querying and analytics.

In [4]:
from src.data_processing.data_loader import DataLoader

file_paths = [
    "../data/analytical/NSW/2020/*.parquet",
    # "../data/analytical/NSW/2021/*.parquet",
    # "../data/analytical/NSW/2022/*.parquet",
    # "../data/analytical/NSW/2023/*.parquet",
    # "../data/analytical/NSW/2024/*.parquet"
]
data_loader = DataLoader()
combined_df = data_loader.read_parquet_files(file_paths)
combined_df.show(truncate=False, n=5)

+-------------------+------------------+---------------+--------------------+-----------------+------------------+------------------+------------------+-------------------+----------+
|date               |monthly_avg_demand|monthly_avg_rrp|monthly_total_demand|monthly_total_rrp|demand_rrp_ratio  |mean_ratio        |stddev_ratio      |z_score            |is_anomaly|
+-------------------+------------------+---------------+--------------------+-----------------+------------------+------------------+------------------+-------------------+----------+
|2020-01-01 00:30:00|7134.15           |48.84          |7134.15             |48.84            |146.0718673218673 |146.0718673218673 |NULL              |NULL               |NULL      |
|2020-01-01 01:00:00|6886.14           |50.46          |6886.14             |50.46            |136.46730083234246|141.26958407710487|6.7914540951000895|-0.7071067811865455|false     |
|2020-01-01 01:30:00|6682.01           |48.73          |6682.01             |48.

# 5. Machine Learning and Forecasting Layer

- **Description:** This layer contains datasets specifically prepared for machine learning and forecasting models, including feature matrices and model predictions.
- **Processing:** Feature engineering, train-test split, and normalization.
- **Storage:** Store as Parquet files.