# Linear regression with scikit-learn

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (AT), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (PE)  of the plant. A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.

First the required libraries are imported.

In [1]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

Next the data is loaded.

In [2]:
df = pd.read_csv('C:\PowePlantProject\powerplant.csv')

We check the structure.

In [3]:
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


Now we describe the variables.

In [4]:
df.describe()

Unnamed: 0,AT,V,AP,RH,PE
count,9568.0,9568.0,9568.0,9568.0,9568.0
mean,19.651231,54.305804,1013.259078,73.308978,454.365009
std,7.452473,12.707893,5.938784,14.600269,17.066995
min,1.81,25.36,992.89,25.56,420.26
25%,13.51,41.74,1009.1,63.3275,439.75
50%,20.345,52.08,1012.94,74.975,451.55
75%,25.72,66.54,1017.26,84.83,468.43
max,37.11,81.56,1033.3,100.16,495.76


Lets now do some bivariate data analysis.

Next we divide dependant and independant variables.

In [5]:
X = df[['AT', 'V', 'AP', 'RH']]
y = df['PE']  

Now we are going to split the data into training and test sets in the following train/test proportions 70%/30%. We set the random seed for reproducibility.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=0)

Next we train the linear regression algorithm.

In [7]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(normalize = False)
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Now we check the coefficients.

In [8]:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df.loc['intercept'] = regressor.intercept_
coeff_df

Unnamed: 0,Coefficient
AT,-1.893973
V,-0.264006
AP,0.081952
RH,-0.151476
intercept,434.005921


Lets make now predictions.

In [9]:
y_pred = regressor.predict(X_test)

Now we can compare actual values to the predicted values.

In [10]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head()

Unnamed: 0,Actual,Predicted
4834,431.23,431.418288
1768,460.01,458.770406
2819,461.14,462.981091
7779,445.9,448.698911
7065,451.29,457.859038


Lets check now some popular accuracy measures for regression problems.

In [11]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 3.655632279825781
Mean Squared Error: 20.948736782448115
Root Mean Squared Error: 4.576979001748655


### Dataset References

Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615, http://dx.doi.org/10.1016/j.ijepes.2014.02.027. (http://www.sciencedirect.com/science/article/pii/S0142061514000908)

Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai)
