![Solar Image ](R.jpg)
# Capstone Project 3: Forecasting Solar Power using ML Techniques
## Subtitle: _*"Exploring ARIMA, Prophet, XGBoost, and LSTM Methods for Solar Predictions"*_

**Author:** Audrey Malloy 

**Date:** April 24th, 2025

## Executive Summary

- **Problem Statement:** 
  This project aims to develop robust predictive models for short-term solar power forecasting, leveraging time series techniques such as ARIMA and Prophet. Accurate predictions are critical for optimizing solar energy usage and resource allocation.

- **Objective:** 
  Build and evaluate machine learning models that forecast solar power generation (`AC_POWER`) using historical trends, weather data, and engineered features with high accuracy.

- **Key Findings:**
  1. **LSTM** achieved the best performance, explaining variance (R-squared: **0.983**) and minimizing errors.
  2. **XGBoost** excelled in minimizing MAE and RMSE, but struggled with generalization (negative R-squared).
  3. Feature importance analysis highlighted weather metrics such as solar irradiance and temperature as significant contributors to predictive accuracy.
  4. Both **ARIMA** and **Prophet** models effectively captured recurring patterns in daily and weekly solar power generation cycles.

- **Recommendations:**
  1. **Ensemble Methods**: Combine the strengths of SARIMAX and Prophet to leverage both models' capabilities for improved accuracy.
  2. **Weather Metrics Integration**: Incorporate additional weather features (e.g., cloud cover, humidity) as external regressors for enhanced predictions.
  3. **Data Expansion**: Collect more granular data points, particularly for rare weather conditions and seasonal transitions, to improve model training.

---


## **Introduction**

### **Problem Statement**
As the world transitions toward renewable energy sources, solar power has become a vital component of sustainable energy systems. However, its inherent variability, driven by weather conditions and seasonality, poses challenges for grid management and energy planning. This project seeks to address these challenges by developing robust, short-term solar power forecasting models. Accurate predictions of solar energy output can ensure efficient resource allocation, stabilize the energy grid, and maximize the economic benefits of solar power.

---

### **Why It Matters**
1. **Renewable Energy Optimization**:
   - Reliable solar power forecasting supports better integration of renewable energy into the grid.
   - Helps reduce dependency on non-renewable sources during peak solar power production.

2. **Operational Efficiency**:
   - Accurate predictions aid in planning storage needs and balancing supply and demand in real time.
   - Minimizes energy wastage and ensures a steady flow of power to consumers.

3. **Economic and Environmental Impact**:
   - Informs decision-making for energy providers, reducing costs associated with unplanned outages or overproduction.
   - Encourages widespread adoption of solar energy by showcasing its reliability and stability.

4. **Advancement in Predictive Modeling**:
   - By comparing ARIMA and Prophet, this project contributes to the growing body of knowledge on time series forecasting methods for renewable energy.

---

## **Methodology**

### **Data Collection**
- **Source**:
  - Historical solar power generation data and weather sensor data from two power plants in India obtained and organized from **Solar Power Generation from Kaggle**: [Solar Power Generation Data](https://www.kaggle.com/datasets/anikannal/solar-power-generation-data).
  - The power generation datasets that include power output and energy data are gathered at the inverter level.
  - The sensor data which includes some weather measurements is gathered at a plant level - single array of sensors optimally placed at the plant.

- **Time Period**:
  - Data spans a 34 day period providing measurements in 15 minutes intervals.

- **Data Format**

Two primary datasets were utilized for this project:

 **1. Generation Powerplant Data**
- **Features**:
  - `DATE_TIME`: Timestamp of data recording.
  - `PLANT_ID`: Unique identifier for the solar power plant.
  - `SOURCE_KEY`: Unique identifier for each source or inverter.
  - `DC_POWER`: Direct Current power output from the solar panels (in Watts).
  - `AC_POWER`: Alternating Current power output after inversion (in Watts).
  - `DAILY_YIELD`: Energy produced in a single day (in kWh).
  - `TOTAL_YIELD`: Cumulative energy generated (in kWh).

- **Data Type**:
  - Numeric features include `DC_POWER`, `AC_POWER`, `DAILY_YIELD`, and `TOTAL_YIELD`.
  - Categorical/object features: `DATE_TIME` and `SOURCE_KEY`.

---

**2. Solar Sensor Data**
- **Features**:
  - `DATE_TIME`: Timestamp of data recording.
  - `PLANT_ID`: Solar plant ID.
  - `SOURCE_KEY`: Identifier for sensors attached to individual modules.
  - `AMBIENT_TEMPERATURE`: Surrounding environmental temperature (in °C).
  - `MODULE_TEMPERATURE`: Temperature measured at the solar panel module (in °C).
  - `IRRADIATION`: Solar radiation intensity during the given time interval (in kW/m²).

- **Data Type**:
  - Categorical/object features: `DATE_TIME` and `SOURCE_KEY`.
  - Integer feature: `PLANT_ID`.
  - Float features: `AMBIENT_TEMPERATURE`, `MODULE_TEMPERATURE`, and `IRRADIATION`.

---
**Preprocessing**:
  - Verified data consistency and cleaned for:
    - Missing values: Imputed or removed incomplete entries.
    - Outliers: Detected and mitigated anomalies.
    - Normalization: Scaled features for better model performance.
  - Extracted features such as lagged values, rolling averages, and time-based indicators (e.g., hour of day `HOUR`, day of week `DAY_OF_WEEK`).


---

## **Exploratory Data Analysis (EDA) Report**

Exploratory Data Analysis (EDA) was conducted to gain insights into the snow depth dataset, identify trends, detect anomalies, and evaluate relationships between variables. This process guided feature selection and informed modeling decisions for snow depth predictions.

---
### **1. Understand Data Distribution**
![Distribution ](Generation_distribution.png)

### **2. Identify Relationships and Trends**
#### **2.1 Time Series Analysis**
- **DC Power Power Generation (Target)**:
  - .Flucation in power generation indicates variations in solar power generation on different days.
  - Peaks may correspond to sunny days with high solar irradiance.
    
![Time Series DC Power Graph](DAILY_DC.png)

- **Plant Efficiency**:
  - Seasonal trends to observe since plant efficiency has low and highs. Weather and seasonal variations in sunlight availability. 
![Plant Efficiency over time](Plant_efficiency_over_time.png)

- **Efficiency and Power Generation Hourly**:
    - Efficiency is higher during early morning (6-8 AM) and late afternoon (3-5 PM).
    - There is a noticeable dip in efficiency during midday (10 AM - 2 PM), despite higher solar power generation.
![Inverter Efficiency over time](Inverter_Efficiency.png)
- This pattern may suggest operational inefficiencies or thermal challenges affecting inverter performance during peak sunlight hours. Investigating inverter cooling systems could provide improvements.

<table>
  <tr>
    <td><img src="AC_HOUR.png" alt="Average AC Power by Hour" width="400" height="500"></td>
    <td><img src="AVG_DC_HOUR.png" alt="Average DC Power by Hour" width="400" height="500"></td>
  </tr>
</table>

- DC power generation closely follows the same pattern as AC power, peaking at midday and dipping during non-solar hours.

---
#### **2.2 Plant Comparison**
<table>
  <tr>
    <td><img src="Inverter_plantID.png" alt="Inverter efficiency by Plant" width="500"></td>
    <td><img src="normalizedyield_PLANT.png" alt="Normalized Yield" width="500"></td>
  </tr>
</table>

- The stark difference indicates operational challenges in Plant 1, possibly linked to inverter maintenance, environmental factors, or outdated equipment. Investigating the cause of inefficiencies in Plant 1 could lead to performance optimization.
- While Plant 1 produces a higher normalized yield overall, the variability suggests inconsistency. Plant 2 may deliver more consistent but lower energy outputs.

![DC POWER and Irradiation by plant](DC_Irradiation.png)


---

#### **2.3 Correlation Heatmap Insights**
The correlation heatmap visualizes relationships between variables in the solar plant dataset using colors—blue (negative correlation) to red (positive correlation).

![Correlation](Correlation_analysis.png)

**Key insights include:**

**1. Strong Positive Correlations**
- `MODULE_TEMPERATURE` and `DC_POWER`:
  Higher module temperatures are associated with increased energy generation.
- `IRRADIATION` and `DC_POWER`:
  Emphasizes the crucial role of sunlight intensity in power generation.
- `DC_POWER` and `AC_POWER`:
  An almost perfect correlation reflects their direct connection in the power conversion process.

**2. Negative Correlations**
- `HOUR`:
  Slight negative correlations with power metrics (`DC_POWER`, `AC_POWER`) indicate energy generation decreases during non-solar hours.

 **3. Low Correlations**
- `PLANT_ID` and `SOURCE_KEY`:
  Negligible correlations with continuous features, serving primarily as identifiers rather than influencing factors.

**4. Efficiency Metrics**
- Correlations involving `Inverter_Efficiency` and `Plant_Efficiency`:
  May reveal valuable operational optimizations or inefficiencies.

**Next Steps**
- Remove redundant features like `PLANT_ID` and `SOURCE_KEY` during **preprocessing**.

---



## **Model Selection**

### **Models**
#### **Selected Models**
1. **ARIMA**:
   - A statistical approach tailored for time series forecasting.
   - Captures dependencies through autoregressive terms, differencing, and moving averages.

2. **Prophet**:
   - A robust time series model designed to handle non-stationary data with clear trend and seasonal components.
   - Excels at decomposing time series into interpretable components (trend, seasonality, holidays).

3. **XGBoost**:
   - An advanced gradient-boosting machine learning model that excels in handling structured data.
   - Particularly effective for regression tasks, leveraging feature engineering to model non-linear relationships.

4. **LSTM**:
   - A deep learning model specialized for sequential data.
   - Captures long-term dependencies and nonlinear temporal patterns using memory cells.

---

### **Justification**
- **ARIMA**:
  - Ideal for modeling short-term dependencies and stationary time series data.
- **Prophet**:
  - Selected for its ability to handle trends and seasonality while allowing easy customization.
- **XGBoost**:
  - Chosen for its flexibility and strong performance in regression problems, particularly with engineered features.
- **LSTM**:
  - Preferred for its ability to learn complex patterns and dependencies in sequential data, making it highly suitable for solar power forecasting.

---

### **Optimization Methods**
- **For ARIMA**:
  1. Conducted Augmented Dickey-Fuller (ADF) tests to ensure stationarity.
  2. Tuned hyperparameters `(p, d, q)` and seasonal components `(P, D, Q, m)` using `auto_arima`.

![ARIMA](ARIMA_forcast_graph.png)

- **For Prophet**:
  1. Incorporated week day and hour effects to model power fluctuations accurately.

![Prophet](Prophetforecase.png)


- **For XGBoost**:
  1. Engineered temporal indicators.
  2. Used grid search to tune hyperparameters such as `learning_rate`, `max_depth`, and `n_estimators`.

- **For LSTM**:
  1. Reshaped the dataset into sequences for time series input.
  2. Implemented regularization methods (dropout, early stopping) to prevent overfitting.
  3. Trained the model using appropriate batch sizes and epochs for optimal performance.

![LSTM Actual v Predict](LSTM_Actualvpredicted.png)

---

## **Feature Importance Analysis**

- **Objective**:
  Feature analysis was performed to identify the most significant predictors of solar power (`AC_POWER`) and improve forecasting accuracy.

  
![Feature Analysis](feature_importance.png)


---


## **Recommendations**

1. **Enhance Model Accuracy**:
   - Implement ensemble models to combine the strengths of ARIMA and Prophet, leveraging their complementary capabilities in capturing short-term dependencies and seasonal trends.

2. **Incorporate External Features**:
   - Include additional weather-related variables such as humidity, and cloud cover to improve model precision.
   - Utilize geographical and elevation data to further refine predictions for specific regions.

3. **Optimize Deployment**:
   - Optimize the computational efficiency of forecasting models to enable real-time predictions for solar power systems.
   - Develop user-friendly dashboards for visualizing forecasts and tracking model performance.

---

### **Further Research Ideas**

1. **Integration of Real-Time Data**:
   - Investigate the feasibility of integrating real-time sensor data (e.g., from weather stations or solar panels) into forecasting models for on-the-fly updates.

2. **Renewable Energy Mix Forecasting**:
   - Expand the scope to include wind and hydropower predictions, creating a holistic renewable energy forecasting system.

3. **Spatial Forecasting**:
   - Develop models that forecast solar power output across multiple locations, leveraging spatial correlations.

4. **Climate Change Impact Analysis**:
   - Study the long-term impact of climate change on solar power generation trends and adjust forecasting models accordingly.

5. **Explainability in Forecasting**:
   - Explore methods to improve model transparency and explainability, making predictions more interpretable for stakeholders.

---


## **Conclusion**

This project successfully demonstrated the effectiveness of multiple machine learning techniques, including ARIMA, Prophet, XGBoost, and LSTM, for forecasting short-term solar power generation (`AC_POWER`).

---

### **Key Outcomes**
1. **Model Performance**:

   ![Model metrics](metric.png)
   
| **Metric**        | **SARIMAX**   | **Prophet**   | **XGBoost**      | **LSTM**        |
|--------------------|---------------|---------------|------------------|-----------------|
| **R2 Score**       | 0.813709      | 0.879648      | -0.542846        | 0.982568        |
| **MAE**            | 2.255862      | 2.431660      | 0.004991         | 0.022973        |
| **RMSE**           | 4.022452      | 3.059362      | 0.009145         | 0.030747        |

   - **ARIMA**:
     - Reliable for capturing short-term dependencies and minimizing absolute forecasting errors.
     - Achieved a MAE of 2.255862 and an RMSE of 4.022452, making it suitable for modeling stationary time series with clear autocorrelations.

   - **Prophet**:
     - Excelled in modeling long-term trends and seasonality, with an R2 of 0.879648 and RMSE of 3.059362.
     - Ideal for datasets with non-stationary behaviors and interpretable seasonal patterns.
    
   - **XGBoost**:
     - Demonstrated strong performance in minimizing error metrics, with the lowest MAE (0.004991) and RMSE (0.009145).
     - Struggled with generalization, as evidenced by the negative R2 score (-0.542846), likely due to insufficient contextual regressors or hyperparameter tuning.

   - **LSTM**:
     - Achieved the best overall performance, with an R2 score of 0.982568 and low error metrics (MAE: 0.022973, RMSE: 0.030747).
     - Effectively captured complex nonlinear and temporal dependencies in the data, making it ideal for high-accuracy, short-term forecasting.


![Model Comparison](comparison_metrics.png)


2. **Feature Analysis**:
   - Weather metrics (e.g., solar irradiance and module temperature) emerged as the most significant predictors.
   - Incorporating lagged values and rolling averages significantly improved the performance of machine learning models like XGBoost and LSTM.
   - Time-based features (e.g., hour of day, day of week) enhanced the ability of ARIMA and Prophet to capture cyclic and seasonal patterns.


3. **Insights on Model Strengths**:
   - **ARIMA**: Reliable for short-term stationarity and simplicity, but less effective at handling complex trends.
   - **Prophet**: Robust for long-term trend analysis and seasonality detection, with flexibility for custom external regressors.
   - **XGBoost**: Strong at reducing large prediction errors, but less adept at modeling sequential dependencies compared to LSTM.
   - **LSTM**: Best suited for non-linear, time-dependent patterns, showcasing its adaptability in solar power forecasting.

---

### **Recommendations for Improvement**
1. **Feature Enrichment**:
   - Incorporate additional weather features, such as cloud cover, wind speed, and humidity, for more accurate predictions.
   - Use spatial data (e.g., geographic location) to refine context-based forecasting.
2. **Model Enhancements**:
   - Investigate ensemble modeling to combine strengths of ARIMA and Prophet (trend/seasonality) with LSTM’s nonlinear pattern recognition.
   - Optimize XGBoost performance by fine-tuning hyperparameters and incorporating additional engineered features.
3. **Real-Time Forecasting**:
   - Explore the integration of real-time data pipelines for dynamic solar power predictions.
   - Leverage automated model updates to ensure adaptability under changing weather conditions.

---

### **Final Thoughts**
Accurate solar power forecasting is critical for optimizing energy management and advancing renewable energy systems. By leveraging a diverse range of models, this project highlights the strengths and trade-offs of statistical, machine learning, and deep learning approaches in addressing the inherent variability of solar power generation. Future research should focus on hybrid modeling and real-time deployment to scale predictive capabilities and accelerate the transition to renewable energy systems.