# Prediction of the Output Power of a Combined Cycle Power Plant


*Machine Learning Foundations for Product Managers by Duke University and Coursera* 

Course project by Dmitrijs Giždevans

## Project topic:

In this project we will build a model to predict the electrical energy output of a Combined Cycle Power Plant, which uses a combination of gas turbines, steam turbines, and heat recovery steam generators to generate power.  We have a set of 9568 hourly average ambient environmental readings from sensors at the power plant which we will use in our model.

The columns in the data consist of hourly average ambient variables:
- Temperature (T) in the range 1.81°C to 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW (Target we are trying to predict)

The dataset may be downloaded as a csv file.  Note that Safari users may have to navigate to File -> Save As and select the option "Save as source" to download the file.  Once you have downloaded the data, please review the Project Modeling Options reading and select a method of working with the data to build your model: 1) using Excel, 2) **using Python**, or 3) using Google AutoML.

*Data source:*

Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615.

Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai)

## Project agenda

1. Preparing data for analysis.
2. Data analysis and approach of modeling. 
3. Model building and calc of metrics for evaluation.
4. Model evaluation and conclusions.

## 1. Preparing data for analysis.

In the beginning, we simply prepare the data by downloading it to the local environment and saving it in a Dataframe that is convenient for analysis.

In [12]:
# Import of required packages
import pandas as pd
import numpy as np
from urllib.request import urlretrieve

# Assign dataset location url of file: url
url = 'https://storage.googleapis.com/aipi_datasets/CCPP_data.csv'

# Save dataset file locally
urlretrieve(url,'CCPP_data.csv')

# Load dataset to dataframe:
df = pd.read_csv('CCPP_data.csv')

Next, we check how good the data is and how homogeneous it is.

In [13]:
# Checking overall data integrity
print(df.head(-1))

         AT      V       AP     RH      PE
0     14.96  41.76  1024.07  73.17  463.26
1     25.18  62.96  1020.04  59.08  444.37
2      5.11  39.40  1012.16  92.14  488.56
3     20.86  57.32  1010.24  76.64  446.48
4     10.82  37.50  1009.23  96.62  473.90
...     ...    ...      ...    ...     ...
9562  14.02  40.10  1015.56  82.44  467.32
9563  16.65  49.69  1014.01  91.00  460.03
9564  13.19  39.18  1023.67  66.78  469.62
9565  31.32  74.33  1012.92  36.48  429.57
9566  24.48  69.45  1013.86  62.39  435.74

[9567 rows x 5 columns]


In [14]:
# Checking data types and completeness of data
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AT      9568 non-null   float64
 1   V       9568 non-null   float64
 2   AP      9568 non-null   float64
 3   RH      9568 non-null   float64
 4   PE      9568 non-null   float64
dtypes: float64(5)
memory usage: 373.9 KB
None


In [4]:
# Checking data deviations and and compliance of data with specified intervals
print(df.describe())

                AT            V           AP           RH           PE
count  9568.000000  9568.000000  9568.000000  9568.000000  9568.000000
mean     19.651231    54.305804  1013.259078    73.308978   454.365009
std       7.452473    12.707893     5.938784    14.600269    17.066995
min       1.810000    25.360000   992.890000    25.560000   420.260000
25%      13.510000    41.740000  1009.100000    63.327500   439.750000
50%      20.345000    52.080000  1012.940000    74.975000   451.550000
75%      25.720000    66.540000  1017.260000    84.830000   468.430000
max      37.110000    81.560000  1033.300000   100.160000   495.760000


As we can see from the analysis above, the data is high-quality, complete, and formatted, ready to build models using it.

## 2. Data analysis and approach of modeling.

To choose a modeling approach, first of all, we will evaluate how the presented data correlate with each other.

In [5]:
# Check data distribution in the dataset:
print(df.corr())

          AT         V        AP        RH        PE
AT  1.000000  0.844107 -0.507549 -0.542535 -0.948128
V   0.844107  1.000000 -0.413502 -0.312187 -0.869780
AP -0.507549 -0.413502  1.000000  0.099574  0.518429
RH -0.542535 -0.312187  0.099574  1.000000  0.389794
PE -0.948128 -0.869780  0.518429  0.389794  1.000000


From the obtained correlation data, we can conclude:

- AT and V significantly correct with each other;
- AT and V have a significant effect on PE;
- AP and RH are less influential variables.

Our task is to build a model that will predict the values of PE(Net hourly electrical energy output ) based on the values of known variables (Temperature (T), Ambient Pressure (AP), Relative Humidity (RH), Exhaust Vacuum (V)). 

Considering that we have an adequate dataset that we can use to train the model, we know all the variables that affect the target, then the best choice would be to use **linear regression** to model our predictions.

To assess the effectiveness of the method used and the correctness of the selected features, we will prepare three datasets based on the received basic data. Namely:

**Dataset 1**, where the features will be AT, AR and RH. We exclude the value of V, since it significantly correlates with AT, and strongly affects the value of the target unit.

**Dataset 2**, where the features will be V, AR, and RH. We exclude the AT value since it significantly correlates with V and also strongly affects the value of the target unit.

**Dataset 3**, where we use all available features (AT, V, AP, RH)

In [15]:
# Splitting  dataset to 3 diffrent datasets for modeling

df1 = df[['AT','RH']]
df2 = df[['V','RH']]
df3 = df[['AT','V','RH']]
dfy = df['PE']

We can now start modeling with linear regression and evaluate the quality and performance of our models with different datasets.

## 3. Model building and calc of metrics for evaluation.


We will use the scikit-learn machine learning library to build a linear regression model and evaluate the models. First, let's download all the necessary tools. 

To evaluate the models and dataset, we will use metrics such as:

- Coefficient of determination (R2)
- Mean Squared Error (MSE)
- Mean Absolute Error (MBE)
- Root Mean Squared Error (RMSE)




In [7]:
# Import libs for modeling and evaluation 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

An important point is that we could immediately evaluate the obtained metrics of learning results, for each dataset will be divided, in a derived form, into training and test sets, in the proportion of 80% and 20%.

Let's start modeling and initial evaluation of the models for each dataset.

In [8]:
# Building model for dataset 1

X= df1
y = dfy

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

linreg = LinearRegression().fit(X_train, y_train)
y_pred = linreg.predict(X_test) 

# Modeling results for dataset 1
d1_train = linreg.score(X_train, y_train)
d1_test = linreg.score(X_test, y_test)

# Model evaluating for model  with AT, AR and RH as features. 

mse_d1 = mean_squared_error(y_test, y_pred)
mae_d1 = mean_absolute_error(y_test, y_pred)
rmse_d1 = np.sqrt(mean_squared_error(y_test, y_pred))
r2_d1 = r2_score(y_test, y_pred)

# Store all results of modeling and evaluating for dataset 1 
d1_results = {'train':d1_train, 
              'test' : d1_test,
              'mse' : mse_d1,
              'mae': mae_d1, 
              'rmse' : rmse_d1,
              'r2' : r2_d1}

print('Regression score (traning set)', d1_train)
print('Regression score (test set)', d1_test)

Regression score (traning set) 0.9197623321851416
Regression score (test set) 0.9256241616264436


The first dataset gave excellent regression scores for both training and test sets. 

In [9]:
# Building model for dataset 2

X= df2
y = dfy

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

linreg = LinearRegression().fit(X_train, y_train)
y_pred = linreg.predict(X_test) 

# Modeling results for dataset 2
d2_train = linreg.score(X_train, y_train)
d2_test = linreg.score(X_test, y_test)

# Model evaluating for model  with V, AR and RH as features.
mse_d2 = mean_squared_error(y_test, y_pred)
mae_d2  = mean_absolute_error(y_test, y_pred)
rmse_d2 = np.sqrt(mean_squared_error(y_test, y_pred))
r2_d2  = r2_score(y_test, y_pred)

# Store all results of modeling and evaluating dataset 2
d2_results = {'train':d2_train, 
              'test' : d2_test, 
              'mse' : mse_d2,
              'mae': mae_d2,
              'rmse' : rmse_d2, 
              'r2' : r2_d2}

print('Regression score (traning set)', d2_train)
print('Regression score (test set)', d2_test)

Regression score (traning set) 0.772340685952033
Regression score (test set) 0.7705492485043965


The regression on the second dataset is less successful and the regression scores are extremely low. It is unlikely that we will be able to use this model for our predictions.

In [10]:
# Building model for dataset 3

X= df3
y = dfy

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

linreg = LinearRegression().fit(X_train, y_train)
y_pred = linreg.predict(X_test) 

# Modeling results for dataset 3
d3_train = linreg.score(X_train, y_train)
d3_test = linreg.score(X_test, y_test)

# Model evaluating for model  with AT, V, AR and RH as features.

mse_d3 = mean_squared_error(y_test, y_pred)
mae_d3 = mean_absolute_error(y_test, y_pred)
rmse_d3 = np.sqrt(mean_squared_error(y_test, y_pred))
r2_d3  = r2_score(y_test, y_pred)

#Store all results of modeling and evaluating dataset 3.
d3_results = {'train':d3_train, 
              'test' : d3_test, 
              'mse' : mse_d3,
              'mae': mae_d3,
              'rmse' : rmse_d3,
              'r2' : r2_d3}

print('Regression score (traning set)', d3_train)
print('Regression score (test set)', d3_test)

Regression score (traning set) 0.9273846176347994
Regression score (test set) 0.9322888939219065


Modeling with all four features shows the best results on both training and test sets.

## 4. Model evaluation and conclusions.

Let's prepare a summary table of model performance metrics for each dataset.


In [11]:
# Summary table of each metrics of datasets
 
df_results = pd.DataFrame(data=[d1_results, d2_results, d3_results], index=['d1', 'd2', 'd3'])

print(df_results)

       train      test        mse       mae      rmse        r2
d1  0.919762  0.925624  21.754028  3.726337  4.664121  0.925624
d2  0.772341  0.770549  67.111554  6.432966  8.192164  0.770549
d3  0.927385  0.932289  19.804675  3.570403  4.450244  0.932289


Let's evaluate the results for each metric

- Coefficient of determination (R2).

The coefficient of determination measures the proportion of the variance explained by the model in the total variance of the target variable. In fact, this measure of quality is the normalized root mean square error. If it is close to one, then the model explains the data well, if it is close to zero, then the predictions are comparable in quality to the constant prediction.

In our case, these metrics indicate model on features **d3** as the most optimal. In the case of d2, the indicator is significantly worse.

- Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).

MSE is used in situations where we need to highlight large errors and choose a model that gives fewer large forecast errors. Blunder becomes more noticeable by squaring the forecast error. And the model that gives us a lower mean square error can be said to have fewer blunders.

The model on dataset **d3** shows the smallest value of both metrics and this is preferable for us. But d2 is significantly higher.

- Mean Absolute Error (MAE).

The Mean Absolute Error (MAE) functional penalizes more for large deviations compared to the absolute mean, and therefore is more sensitive to outliers. When using any of these two functionals, it can be useful to analyze which objects make the greatest contribution to the total error - it is possible that an error was made on these objects when calculating features or a target value.

According to this metric, **d3** is also preferable for us, since it has the minimum of the presented values.



# **Сonclusion:**

*Using linear regression, we can successfully predict changes in the PE (Net hourly electrical energy output) indicator using the full amount of data (Temperature (T), Ambient Pressure (AP), Relative Humidity (RH), Exhaust Vacuum (V)). 

*In case of some data weakness, we could predict PE without feature V (Exhaust Vacuum), with a slight deterioration in the prediction quality.

*For such a task, linear regression will be the optimal method, since it will work quickly and efficiently.
