# 1. Raw Data Layer

- **Description:** Store raw data as received from the source without any transformations. This serves as the immutable source of truth.
- **Storage:** CSV files in a structured directory hierarchy

In [25]:
from src.data_loader import DataLoader

file_paths = [
    "../data/raw/NSW/2020/*.csv",
    "../data/raw/NSW/2021/*.csv",
    "../data/raw/NSW/2022/*.csv",
    "../data/raw/NSW/2023/*.csv",
    "../data/raw/NSW/2024/*.csv"
]
data_loader = DataLoader()
combined_df = data_loader.read_csv_files(file_paths)
combined_df.show(truncate=False, n=5)

AttributeError: 'DataLoader' object has no attribute 'read_csv_files'

# 2. Staging Data Layer

- **Description:** In this layer, data is cleaned, validated, and transformed into a consistent format. It serves as an intermediary stage before further processing.
- **Processing:** Use Apache Spark or Pandas to read CSV files, clean, and transform the data.
- **Storage:** Store the cleaned and transformed data in Parquet files for efficient querying and processing.

In [28]:
from src.data_loader import DataLoader

file_paths = [
    "../data/staging/NSW/2020/*.parquet",
    "../data/staging/NSW/2021/*.parquet",
    "../data/staging/NSW/2022/*.parquet",
    "../data/staging/NSW/2023/*.parquet",
    "../data/staging/NSW/2024/*.parquet"
]
data_loader = DataLoader()
combined_df = data_loader.read_parquet_files(file_paths)
combined_df.show(truncate=False, n=5)
# combined_df.printSchema()

+------+-----------+-----+----------+----+
|REGION|TOTALDEMAND|RRP  |PERIODTYPE|date|
+------+-----------+-----+----------+----+
|NSW1  |7286.8     |38.58|TRADE     |NULL|
|NSW1  |7070.23    |38.44|TRADE     |NULL|
|NSW1  |6938.4     |37.96|TRADE     |NULL|
|NSW1  |6700.3     |37.84|TRADE     |NULL|
|NSW1  |6406.33    |35.4 |TRADE     |NULL|
+------+-----------+-----+----------+----+
only showing top 5 rows



# 3. Curated Data Layer

- **Description:** This layer stores data that has been further processed and enriched, optimized for consumption by analytics and machine learning models.
- **Processing:** Perform aggregation, normalization, and feature engineering.
- **Storage:** Store the curated data in Parquet files.

In [29]:
from src.data_loader import DataLoader

file_paths = [
    "../data/curated/NSW/2020/*.parquet",
    "../data/curated/NSW/2021/*.parquet",
    "../data/curated/NSW/2022/*.parquet",
    "../data/curated/NSW/2023/*.parquet",
    "../data/curated/NSW/2024/*.parquet"
]
data_loader = DataLoader()
combined_df = data_loader.read_parquet_files(file_paths)
combined_df.show(truncate=False, n=5)

+----+-----------------+------------------+--------------------+------------------+
|date|avg_demand       |avg_rrp           |total_demand        |total_rrp         |
+----+-----------------+------------------+--------------------+------------------+
|NULL|8264.217157258077|152.30456317204286|1.229715513000002E7 |226629.18999999977|
|NULL|7945.958994252869|57.50137212643676 |1.1060774919999994E7|80041.90999999997 |
|NULL|7342.016155913984|46.16456989247321 |1.0924920040000008E7|68692.88000000014 |
|NULL|7791.773219086021|41.94977822580641 |1.1594158549999999E7|62421.269999999946|
|NULL|8690.303077956987|47.98088709677423 |1.2931170979999997E7|71395.56000000006 |
+----+-----------------+------------------+--------------------+------------------+
only showing top 5 rows



# 4. Analytical Data Layer

- **Description:** This layer is optimized for direct querying and consumption by BI tools and dashboards.
- **Processing:** Create views, precomputed aggregates, and indices for fast query performance.
- **Storage:** Use Parquet files for efficient querying and analytics.

# 5. Machine Learning and Forecasting Layer

- **Description:** This layer contains datasets specifically prepared for machine learning and forecasting models, including feature matrices and model predictions.
- **Processing:** Feature engineering, train-test split, and normalization.
- **Storage:** Store as Parquet files.