## 1. Research & Data Sourcing

### 1.1 Literature Review Summary
- **Load Forecasting in Smart Cities:** Recent studies emphasize the importance of integrating multi-modal data (energy, weather, socio-economic) for accurate load forecasting, especially in greenfield contexts where historical data is limited.
- **Hybrid Models:** Literature supports combining statistical (ARIMA, SARIMA) and machine learning models (LSTM, XGBoost) for improved accuracy.
- **Data Scarcity Solutions:** Simulation (e.g., Monte Carlo for smart meter data) and transfer learning are common approaches to address data scarcity in new urban developments.

**References:**
- Hong, T., Pinson, P., & Fan, S. (2014). Global Energy Forecasting Competition 2012. International Journal of Forecasting.
- Kong, W., Dong, Z. Y., Jia, Y., Hill, D. J., Xu, Y., & Zhang, Y. (2017). Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Transactions on Smart Grid.
- [Add more as needed]

### 1.2 Data Source Inventory
- **CEA API:** Real-time and historical electricity consumption data for India (https://cea.nic.in/)
- **Simulated GIFT City Smart Meter Data:** Generated using Monte Carlo methods to mimic smart meter readings in a greenfield city.
- **OpenWeatherMap API:** Hourly/daily meteorological data (temperature, humidity, etc.) for GIFT City region (https://openweathermap.org/api)
- **Other Sources:** Socio-economic indicators, if available.

---

In [None]:
# 1.3 Data Acquisition: CEA API (Placeholder)
# Note: CEA API may require registration or manual download. Replace URL and params as needed.
import pandas as pd
import requests

# Example placeholder for CEA data download
cea_url = 'https://cea.nic.in/api/endpoint'  # Replace with actual endpoint
params = {'state': 'Gujarat', 'from': '2024-01-01', 'to': '2024-12-31'}
# response = requests.get(cea_url, params=params)
# cea_data = pd.DataFrame(response.json())
# cea_data.to_csv('cea_gujarat_2024.csv', index=False)
print('CEA data acquisition: Placeholder - update with actual API details.')

""" 1.4 Simulate GIFT City Smart Meter Data (Monte Carlo) """
import numpy as np
import pandas as pd

np.random.seed(42)
dates = pd.date_range(start='2024-01-01', end='2024-12-31', freq='H')
n_households = 1000
# Simulate hourly load (kWh) for each household
simulated_data = pd.DataFrame({
    'timestamp': np.tile(dates, n_households),
    'household_id': np.repeat(np.arange(n_households), len(dates)),
    'load_kwh': np.random.normal(loc=1.5, scale=0.5, size=n_households*len(dates)).clip(0.2, 4.0)
})
simulated_data.to_csv('simulated_gift_smart_meter.csv', index=False)
print('Simulated smart meter data saved as simulated_gift_smart_meter.csv')

# 1.5 Fetch Weather Data from OpenWeatherMap API (Template)
# Requires a free API key from https://openweathermap.org/api
import requests
import pandas as pd

api_key = 'YOUR_API_KEY'  # Replace with your OpenWeatherMap API key
city = 'Gandhinagar'  # Closest city to GIFT City
url = f'https://api.openweathermap.org/data/2.5/onecall/timemachine'

# Example: Fetch weather for a specific date (Unix timestamp)
import time
timestamp = int(time.mktime(pd.Timestamp('2024-01-01').timetuple()))
params = {
    'lat': 23.2237,  # GIFT City latitude
    'lon': 72.6507,  # GIFT City longitude
    'dt': timestamp,
    'appid': api_key,
    'units': 'metric'
}
# response = requests.get(url, params=params)
# weather_data = response.json()
# print(weather_data)
print('Weather data acquisition: Template - insert your API key and loop over dates as needed.')


# Project Introduction: Load Forecasting for GIFT City

This notebook documents the foundational data engineering and infrastructure steps for the hybrid load forecasting model, as part of the GIFT City Capstone project.

**Objective:**
- To build a robust, modular, and reproducible data pipeline that sources, integrates, preprocesses, and engineers features from multi-modal datasets, enabling accurate and robust load forecasting for a greenfield smart city context.

**Scope of this Notebook:**
- Research and document data sources (energy, weather, simulated smart meter data).
- Integrate and preprocess all datasets.
- Engineer domain-specific features for downstream ML models.
- Develop a modular ETL pipeline.
- Validate data and features with EDA and visualizations.
- Document all steps for team handoff and reproducibility.


# Load Forecasting Data Infrastructure Notebook

This notebook documents the foundational data engineering steps for the hybrid load forecasting model (GIFT City Capstone). It covers data sourcing, integration, preprocessing, feature engineering, pipeline development, and validation.

## 1. Research & Data Sourcing
- Review literature and identify optimal datasets.
- Fetch/simulate data from CEA API, GIFT City smart meters (Monte Carlo), and OpenWeatherMap API.
- Document all sources and scripts.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
import seaborn as sns
# Add more as needed (e.g., sklearn, datetime)

### 1.1 Literature Review & Data Source Inventory
- Summarize key findings from literature.
- List and describe all data sources.

### 1.2 Data Acquisition Scripts
- CEA API: [Insert code to fetch data]
- GIFT City Smart Meter Simulation: [Insert Monte Carlo code]
- OpenWeatherMap API: [Insert code to fetch weather data]

## 2. Data Integration
- Load and merge all datasets.
- Ensure time alignment and consistency.

## 3. Data Preprocessing
- Outlier handling (IQR method).
- Missing value imputation (forward-fill, etc.).
- Feature normalization/scaling.

## 4. Feature Engineering
- Temporal lags, interaction terms, derived metrics.
- Feature selection (correlation, PCA).
- Visualizations (correlation heatmaps, feature importance).

## 5. ETL Pipeline Development
- Modularize steps into functions/classes.
- Build and test ETL pipeline.
- Add unit tests.

## 6. Validation & Documentation
- EDA and data quality checks.
- Save visualizations and summary reports.
- Document all steps for team handoff.

## 7. Deliverables
- Unified, preprocessed dataset (CSV/Parquet).
- Feature engineering report (notebook with code and plots).
- Automated ETL pipeline scripts with tests.
- Interim validation metrics and EDA outputs.