## 1. Research & Data Sourcing

### 1.1 Literature Review Summary
- **Load Forecasting in Smart Cities:** Recent studies emphasize the importance of integrating multi-modal data (energy, weather, socio-economic) for accurate load forecasting, especially in greenfield contexts where historical data is limited.
- **Hybrid Models:** Literature supports combining statistical (ARIMA, SARIMA) and machine learning models (LSTM, XGBoost) for improved accuracy.
- **Data Scarcity Solutions:** Simulation (e.g., Monte Carlo for smart meter data) and transfer learning are common approaches to address data scarcity in new urban developments.

**References:**
- Hong, T., Pinson, P., & Fan, S. (2014). Global Energy Forecasting Competition 2012. International Journal of Forecasting.
- Kong, W., Dong, Z. Y., Jia, Y., Hill, D. J., Xu, Y., & Zhang, Y. (2017). Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Transactions on Smart Grid.
- [Add more as needed]

### 1.2 Data Source Inventory
- **CEA API:** Real-time and historical electricity consumption data for India (https://cea.nic.in/)
- **Simulated GIFT City Smart Meter Data:** Generated using Monte Carlo methods to mimic smart meter readings in a greenfield city.
- **OpenWeatherMap API:** Hourly/daily meteorological data (temperature, humidity, etc.) for GIFT City region (https://openweathermap.org/api)
- **Other Sources:** Socio-economic indicators, if available.

---

In [None]:
# 1.3 Data Acquisition: Load Provided Power Data CSVs
# Load both data files from the data folder and preview their structure

import pandas as pd
import os

data_folder = 'data'
file1 = 'power-supply-position-pe.csv'
file2 = 'Daily_Power_Gen_Source_march_23.csv'

path1 = os.path.join(data_folder, file1)
path2 = os.path.join(data_folder, file2)

try:
    df_power_supply = pd.read_csv(path1)
    print(f'{file1} loaded successfully:')
    display(df_power_supply.head())
    print(df_power_supply.info())
except Exception as e:
    print(f'Error loading {file1}:', e)

try:
    df_power_gen = pd.read_csv(path2)
    print(f'{file2} loaded successfully:')
    display(df_power_gen.head())
    print(df_power_gen.info())
except Exception as e:
    print(f'Error loading {file2}:', e)


In [None]:
# 1.4 Data Assessment: Explore Structure and Content of Both Data Files

print('--- Power Supply Position Data ---')
print('Columns:', df_power_supply.columns.tolist())
print(df_power_supply.describe(include='all'))
print(df_power_supply.nunique())
print(df_power_supply.isnull().sum())

print('\n--- Daily Power Generation Source Data ---')
print('Columns:', df_power_gen.columns.tolist())
print(df_power_gen.describe(include='all'))
print(df_power_gen.nunique())
print(df_power_gen.isnull().sum())


# Project Introduction: Load Forecasting for GIFT City

This notebook documents the foundational data engineering and infrastructure steps for the hybrid load forecasting model, as part of the GIFT City Capstone project.

**Objective:**
- To build a robust, modular, and reproducible data pipeline that sources, integrates, preprocesses, and engineers features from multi-modal datasets, enabling accurate and robust load forecasting for a greenfield smart city context.

**Scope of this Notebook:**
- Research and document data sources (energy, weather, simulated smart meter data).
- Integrate and preprocess all datasets.
- Engineer domain-specific features for downstream ML models.
- Develop a modular ETL pipeline.
- Validate data and features with EDA and visualizations.
- Document all steps for team handoff and reproducibility.


# Load Forecasting Data Infrastructure Notebook

This notebook documents the foundational data engineering steps for the hybrid load forecasting model (GIFT City Capstone). It covers data sourcing, integration, preprocessing, feature engineering, pipeline development, and validation.

## 1. Research & Data Sourcing
- Review literature and identify optimal datasets.
- Fetch/simulate data from CEA API, GIFT City smart meters (Monte Carlo), and OpenWeatherMap API.
- Document all sources and scripts.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
import seaborn as sns
# Add more as needed (e.g., sklearn, datetime)

### 1.1 Literature Review & Data Source Inventory
- Summarize key findings from literature.
- List and describe all data sources.

### 1.2 Data Acquisition Scripts
- CEA API: [Insert code to fetch data]
- GIFT City Smart Meter Simulation: [Insert Monte Carlo code]
- OpenWeatherMap API: [Insert code to fetch weather data]

## 2. Data Integration
- Load and merge all datasets.
- Ensure time alignment and consistency.

## 3. Data Preprocessing
- Outlier handling (IQR method).
- Missing value imputation (forward-fill, etc.).
- Feature normalization/scaling.

## 4. Feature Engineering
- Temporal lags, interaction terms, derived metrics.
- Feature selection (correlation, PCA).
- Visualizations (correlation heatmaps, feature importance).

## 5. ETL Pipeline Development
- Modularize steps into functions/classes.
- Build and test ETL pipeline.
- Add unit tests.

## 6. Validation & Documentation
- EDA and data quality checks.
- Save visualizations and summary reports.
- Document all steps for team handoff.

## 7. Deliverables
- Unified, preprocessed dataset (CSV/Parquet).
- Feature engineering report (notebook with code and plots).
- Automated ETL pipeline scripts with tests.
- Interim validation metrics and EDA outputs.