In this notebook, we build some machine learning models to predict the  electrical energy output of a combined cycle power plant (CCPP) based on ambient environmental characteristics. The dataset used contains the following columns:

- Ambient Temperature (AT) in °C,
- Exhaust Vacuum (V) in cm Hg,
- Ambient Pressure (AP) in milibar,
- Relative Humidity (RH) in percent,
- Net hourly electrical energy output (PE) in MW.

PE is our target variable, and the other variables are features used to train the models. Since our target is a continuous numerical variable, we will build regression models, and we will use output metrics such as mean squared errors (MSE) and R squared to evaluate the performance of different models.

First, let's import the necessary libraries.

In [1]:
#Library for file path
import os

#Pandas for data
import pandas as pd

#Scikit-learn for machine learning models
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

Before building the regression models, we need to import the dataset.

In [2]:
#Get the directory path for the dataset file
cwd = os.getcwd()
data_path = os.path.join(cwd, "CCPP_data.csv")
data_path

'/content/CCPP_data.csv'

In [3]:
#Read the csv file as a dataframe
df = pd.read_csv(data_path)

#View top 5 rows of dataset
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


Our features and target variable are defined as below.

In [4]:
#Features
X = df[['AT', 'V', 'AP', 'RH']]

#Target
Y = df[['PE']]

We will use 20% of our data as test set.

In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1729)

Now, let's build a linear regression model using the training set. Then, we can evaluate the performance of the linear regression model on the test set.

In [6]:
#Linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, Y_train)

In [7]:
#MSE on test set
Y_test_hat = lin_reg.predict(X_test)
lin_reg_MSE = mean_squared_error(Y_test, Y_test_hat)

#R squared on test set
lin_reg_R2 = lin_reg.score(X_test, Y_test)

print(f'The MSE of linear regression model on test set is {lin_reg_MSE}.')
print(f'The R squared of linear regression model on test set is {lin_reg_R2}.')

The MSE of linear regression model on test set is 21.28905550582786.
The R squared of linear regression model on test set is 0.927958603654873.


Next, let's build a quadratic regression model and do the same.

In [8]:
#Transform features quadratically
pr = PolynomialFeatures(degree=2)
X_train_quad = pr.fit_transform(X_train)

#Quadratic regression model
quad_reg = LinearRegression()
quad_reg.fit(X_train_quad, Y_train)

In [9]:
#MSE on test set
X_test_quad = pr.fit_transform(X_test)
Y_test_hat = quad_reg.predict(X_test_quad)
quad_reg_MSE = mean_squared_error(Y_test, Y_test_hat)

#R squared on test set
quad_reg_R2 = quad_reg.score(X_test_quad, Y_test)

print(f'The MSE of quadratic regression model on test set is {quad_reg_MSE}.')
print(f'The R squared of quadratic regression model on test set is {quad_reg_R2}.')

The MSE of quadratic regression model on test set is 18.93231445711361.
The R squared of quadratic regression model on test set is 0.9359337304014201.


The quadratic regression model is better than the linear regression model, having a lower MSE and a higher R squared. To reduce the risk of overfitting, we will apply ridge regression or lasso regression to the quadratic regression model. We will find the suitable hyperparameter alpha for ridge and lasso regressions by using 5-fold grid search cross validation.

In [10]:
#Grid search for ridge regression
RR = Ridge()
Grid_RR = GridSearchCV(RR, {"alpha": [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}, cv=5)

#Suitable hyperparameter for quadratic ridge regression
Grid_RR.fit(X_train_quad, Y_train)
quad_RR = Grid_RR.best_estimator_
quad_RR

In [11]:
#Grid search for lasso regression
LR = Lasso(tol=0.05)
Grid_LR = GridSearchCV(LR, {"alpha": [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}, cv=5)

#Suitable hyperparameter for quadratic lasso regression
Grid_LR.fit(X_train_quad, Y_train)
quad_LR = Grid_LR.best_estimator_
quad_LR

So, the suitable values of alpha for quadratic ridge regression and quadratic lasso regression are alpha = 1 and alpha = 10, respectively. We can compare the performance of these models by calculating their MSE and R squared on the test set.

In [12]:
#Quadratic ridge regression MSE on test set
Y_test_hat = quad_RR.predict(X_test_quad)
quad_RR_MSE = mean_squared_error(Y_test, Y_test_hat)

#Quadratic ridge regression R squared on test set
quad_RR_R2 = quad_RR.score(X_test_quad, Y_test)

print(f'The MSE of quadratic ridge regression model on test set is {quad_RR_MSE}.')
print(f'The R squared of quadratic ridge regression model on test set is {quad_RR_R2}.')

The MSE of quadratic ridge regression model on test set is 18.925896344702355.
The R squared of quadratic ridge regression model on test set is 0.935955449062442.


In [13]:
#Quadratic lasso regression MSE on test set
Y_test_hat = quad_LR.predict(X_test_quad)
quad_LR_MSE = mean_squared_error(Y_test, Y_test_hat)

#Quadratic lasso regression R squared on test set
quad_LR_R2 = quad_LR.score(X_test_quad, Y_test)

print(f'The MSE of quadratic lasso regression model on test set is {quad_LR_MSE}.')
print(f'The R squared of quadratic lasso regression model on test set is {quad_LR_R2}.')

The MSE of quadratic lasso regression model on test set is 20.706752033072615.
The R squared of quadratic lasso regression model on test set is 0.9299290976141951.


Between the quadratic ridge regression model and the quadratic lasso regression model, the quadratic ridge regression model performs better on the test set, with a lower MSE and a higher R squared. Also, note that the quadratic ridge regression model is slightly better than the quadratic regression model without ridge regression, with a slightly lower MSE and a slightly higher R squared. Therefore, out of the regression models built here, the quadratic ridge regression model with alpha = 1 predicts the electrical energy output most accurately.