# 1. Raw Data Layer

- **Description:** Store raw data as received from the source without any transformations. This serves as the immutable source of truth.
- **Storage:** CSV files in a structured directory hierarchy

In [25]:
from src.data_loader import DataLoader

file_paths = [
    "../data/raw/NSW/2020/*.csv",
    "../data/raw/NSW/2021/*.csv",
    "../data/raw/NSW/2022/*.csv",
    "../data/raw/NSW/2023/*.csv",
    "../data/raw/NSW/2024/*.csv"
]
data_loader = DataLoader()
combined_df = data_loader.read_csv_files(file_paths)
combined_df.show(truncate=False, n=5)

AttributeError: 'DataLoader' object has no attribute 'read_csv_files'

# 2. Staging Data Layer

- **Description:** In this layer, data is cleaned, validated, and transformed into a consistent format. It serves as an intermediary stage before further processing.
- **Processing:** Use Apache Spark or Pandas to read CSV files, clean, and transform the data.
- **Storage:** Store the cleaned and transformed data in Parquet files for efficient querying and processing.

In [37]:
from src.data_loader import DataLoader

file_paths = [
    "../data/staging/NSW/2020/*.parquet",
    "../data/staging/NSW/2021/*.parquet",
    "../data/staging/NSW/2022/*.parquet",
    "../data/staging/NSW/2023/*.parquet",
    "../data/staging/NSW/2024/*.parquet"
]
data_loader = DataLoader()
combined_df = data_loader.read_parquet_files(file_paths)
combined_df.show(truncate=False, n=5)
# combined_df.printSchema()

+------+-----------+-----+----------+-------------------+
|REGION|TOTALDEMAND|RRP  |PERIODTYPE|date               |
+------+-----------+-----+----------+-------------------+
|NSW1  |7134.15    |48.84|TRADE     |2020-01-01 00:30:00|
|NSW1  |6886.14    |50.46|TRADE     |2020-01-01 01:00:00|
|NSW1  |6682.01    |48.73|TRADE     |2020-01-01 01:30:00|
|NSW1  |6452.46    |48.92|TRADE     |2020-01-01 02:00:00|
|NSW1  |6286.89    |49.49|TRADE     |2020-01-01 02:30:00|
+------+-----------+-----+----------+-------------------+
only showing top 5 rows



# 3. Curated Data Layer

- **Description:** This layer stores data that has been further processed and enriched, optimized for consumption by analytics and machine learning models.
- **Processing:** Perform aggregation, normalization, and feature engineering.
- **Storage:** Store the curated data in Parquet files.

In [38]:
from src.data_loader import DataLoader

file_paths = [
    "../data/curated/NSW/2020/*.parquet",
    "../data/curated/NSW/2021/*.parquet",
    "../data/curated/NSW/2022/*.parquet",
    "../data/curated/NSW/2023/*.parquet",
    "../data/curated/NSW/2024/*.parquet"
]
data_loader = DataLoader()
combined_df = data_loader.read_parquet_files(file_paths)
combined_df.show(truncate=False, n=5)

+-------------------+----------+-------+------------+---------+
|date               |avg_demand|avg_rrp|total_demand|total_rrp|
+-------------------+----------+-------+------------+---------+
|2020-01-01 07:00:00|6125.52   |32.66  |6125.52     |32.66    |
|2020-01-01 10:00:00|6424.01   |12.92  |6424.01     |12.92    |
|2020-01-04 03:00:00|6536.29   |45.37  |6536.29     |45.37    |
|2020-01-07 07:30:00|7809.8    |49.46  |7809.8      |49.46    |
|2020-01-09 01:30:00|6692.62   |37.5   |6692.62     |37.5     |
+-------------------+----------+-------+------------+---------+
only showing top 5 rows



# 4. Analytical Data Layer

- **Description:** This layer is optimized for direct querying and consumption by BI tools and dashboards.
- **Processing:** Create views, precomputed aggregates, and indices for fast query performance.
- **Storage:** Use Parquet files for efficient querying and analytics.

In [39]:
from src.data_loader import DataLoader

file_paths = [
    "../data/analytical/NSW/2020/*.parquet",
    "../data/analytical/NSW/2021/*.parquet",
    "../data/analytical/NSW/2022/*.parquet",
    "../data/analytical/NSW/2023/*.parquet",
    "../data/analytical/NSW/2024/*.parquet"
]
data_loader = DataLoader()
combined_df = data_loader.read_parquet_files(file_paths)
combined_df.show(truncate=False, n=5)

+-------------------+------------------+---------------+--------------------+-----------------+
|date               |monthly_avg_demand|monthly_avg_rrp|monthly_total_demand|monthly_total_rrp|
+-------------------+------------------+---------------+--------------------+-----------------+
|2020-01-01 07:00:00|6125.52           |32.66          |6125.52             |32.66            |
|2020-01-01 10:00:00|6424.01           |12.92          |6424.01             |12.92            |
|2020-01-04 03:00:00|6536.29           |45.37          |6536.29             |45.37            |
|2020-01-07 07:30:00|7809.8            |49.46          |7809.8              |49.46            |
|2020-01-09 01:30:00|6692.62           |37.5           |6692.62             |37.5             |
+-------------------+------------------+---------------+--------------------+-----------------+
only showing top 5 rows



# 5. Machine Learning and Forecasting Layer

- **Description:** This layer contains datasets specifically prepared for machine learning and forecasting models, including feature matrices and model predictions.
- **Processing:** Feature engineering, train-test split, and normalization.
- **Storage:** Store as Parquet files.

In [41]:
from src.data_loader import DataLoader

file_paths = [
    # "../data/ml_forecasting/NSW/2020/*.parquet",
    # "../data/ml_forecasting/NSW/2021/*.parquet",
    # "../data/ml_forecasting/NSW/2022/*.parquet",
    "../data/ml_forecasting/NSW/2023/*.parquet",
    # "../data/ml_forecasting/NSW/2024/*.parquet"
]
data_loader = DataLoader()
combined_df = data_loader.read_parquet_files(file_paths)
combined_df.show(truncate=False, n=5)

+------+-------+-----------------+-----------------+-------------------+-----------------+-----------------+---------------------------------------------------------------------------------------------+-----------------+
|REGION|quarter|avg_demand       |label            |total_demand       |total_rrp        |demand_rrp_ratio |features                                                                                     |prediction       |
+------+-------+-----------------+-----------------+-------------------+-----------------+-----------------+---------------------------------------------------------------------------------------------+-----------------+
|NSW   |1      |7643.787197420617|90.12909598214287|6.163949995999986E7|726801.0300000001|84.80931839075662|[7643.787197420617,90.12909598214287,6.163949995999986E7,726801.0300000001,84.80931839075662]|90.12909598214287|
+------+-------+-----------------+-----------------+-------------------+-----------------+-----------------+--------