# Machine Learning Foundations for Product Manager

## Project topic

In this project we will build a model to predict the electrical energy output of a [Combined Cycle Power Plant](https://www.wikiwand.com/en/articles/Combined_cycle_power_plant), which uses a combination of gas turbines, steam turbines, and heat recovery steam generators to generate power.  We have a set of 9568 hourly average ambient environmental readings from sensors at the power plant which we will use in our model.

The columns in the data consist of hourly average ambient variables:
- Temperature (T) in the range 1.81°C to 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW (Target we are trying to predict)

The dataset may be downloaded as [a csv file](https://storage.googleapis.com/aipi_datasets/CCPP_data.csv).

Data source:

- Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615.
- Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai)

## Selecting the type of ML approach

- The dataset we use in this project provides observations organized as structured data (a table in the CSV file)
- The dataset includes five continuous columns; four columns can be used to train a model, therefore we have four features
- The PE column represents values we want to predict for new data – it will be our target
- The problem we are solving here is a prediction problem, hence a regression problem
- To evaluate our model we can use Root Mean Squared Error (RMSE), which is sensitive to large errors and quite interpretable since it’s in the same unit as the target variable

## Features and possible ML algorithms

To predict the electrical energy output of a Combined Cycle Power Plant, we can use all four features: 

- Temperature (T) in the range 1.81°C to 37.11°C
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg

We can try the following models:

- Linear Regression and its variants, which could be a good baseline
- Tree-based models like Random Forest Regressor and Gradient Boosting Regressor, which are effective for handling nonlinear relationships

Since our dataset doesn't have complex relationships in the data, using neural networks will be overkill.

## Validation strategy for comparing different models

Given the dataset size, 5-Fold Cross-Validation can offer a robust performance comparison among models without being overly computationally expensive. This will provide a more generalized performance estimate across various subsets of the data, helping ensure that the model selection process isn’t dependent on one particular data split.

## Comparing models

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

In [2]:
# Load the dataset
data = pd.read_csv('CCPP_data.csv')

In [3]:
# Split the dataset: 80% for training, 20% for testing
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

In [4]:
# Define the features (X) and target (y) variables for the training set
X = train_data.drop(columns=['PE'])  # Assuming 'PE' is the target column for energy output
y = train_data['PE']

# Define models to compare
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest Regressor': RandomForestRegressor(random_state=42),
    'Gradient Boosting Regressor': GradientBoostingRegressor(random_state=42)
}

# Dictionary to store the mean RMSE for each model
model_scores = {}

In [5]:
# Set up 5-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform 5-Fold Cross-Validation for each model and store the mean RMSE
for model_name, model in models.items():
    # Cross-validate and calculate RMSE
    cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_root_mean_squared_error')
    mean_rmse = -np.mean(cv_scores)  # Convert to positive RMSE
    model_scores[model_name] = mean_rmse

In [6]:
# Print the RMSE for each model
print("Model RMSE Scores:")
for model_name, score in model_scores.items():
    print(f"{model_name}: {score}")

Model RMSE Scores:
Linear Regression: 4.5721058283634095
Random Forest Regressor: 3.4601575562911058
Gradient Boosting Regressor: 3.9210927380231175


In [7]:
# Find the model with the lowest mean RMSE
best_model_name = min(model_scores, key=model_scores.get)
best_model_rmse = model_scores[best_model_name]

print(f"\nBest Model: {best_model_name}")
print(f"RMSE: {best_model_rmse}")


Best Model: Random Forest Regressor
RMSE: 3.4601575562911058


## Evaluating performance of Random Forest Regressor

Given that the Random Forest Regressor has the lowest RMSE (3.46), we’ll select it as the final model. To evaluate its performance on the held-out test set, we’ll train the Random Forest Regressor on the full training set and then calculate the RMSE on the test set.

In [9]:
from sklearn.metrics import mean_squared_error

# Define the features (X_train and X_test) and target (y_train and y_test)
X_train = train_data.drop(columns=['PE'])
y_train = train_data['PE']
X_test = test_data.drop(columns=['PE'])
y_test = test_data['PE']

# Train the final model (Random Forest Regressor) on the full training data
final_model = RandomForestRegressor(random_state=42)
final_model.fit(X_train, y_train)

# Predict on the test set
y_pred = final_model.predict(X_test)

# Calculate RMSE on the test set
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Test RMSE for Random Forest Regressor: {test_rmse}")

Test RMSE for Random Forest Regressor: 3.2432202566089683


The Test RMSE indicates that the model performs well on unseen data.