APPLYING SUPPORT VECTOR MACHINE (REGRESSION) - MACHINE LEARNING ALGORITHM FROM SCRATCH WITH REAL DATASETS

In [None]:
1. Understanding the datasets

In [None]:
Data Set Information:

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.
A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.
For comparability with our baseline studies, and to allow 5x2 fold statistical tests be carried out, we provide the data shuffled five times. For each shuffling 2-fold CV is carried out and the resulting 10 measurements are used for statistical testing.

In [None]:
Attribute Information:

Features consist of hourly average ambient variables 
- Temperature (T) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
- Net hourly electrical energy output (EP) 420.26-495.76 MW

The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization.

In [None]:
2. Importing Datasets

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("ccpp.csv")
print(df)

         X1     X2       X3     X4       Y
0      8.34  40.77  1010.84  90.01  480.48
1     23.64  58.49  1011.40  74.20  445.75
2     29.74  56.90  1007.15  41.91  438.76
3     19.07  49.69  1007.22  76.79  453.09
4     11.80  40.66  1017.13  97.20  464.43
5     13.97  39.16  1016.05  84.60  470.96
6     22.10  71.29  1008.20  75.38  442.35
7     14.47  41.76  1021.98  78.41  464.00
8     31.25  69.51  1010.25  36.83  428.77
9      6.77  38.18  1017.80  81.13  484.31
10    28.28  68.67  1006.36  69.90  435.29
11    22.99  46.93  1014.15  49.42  451.41
12    29.30  70.04  1010.95  61.23  426.25
13     8.14  37.49  1009.04  80.33  480.66
14    16.92  44.60  1017.34  58.75  460.17
15    22.72  64.15  1021.14  60.34  453.13
16    18.14  43.56  1012.83  47.10  461.71
17    11.49  44.63  1020.44  86.04  471.08
18     9.94  40.46  1018.90  68.51  473.74
19    23.54  41.10  1002.05  38.05  448.56
20    14.90  52.05  1015.11  77.33  464.82
21    33.80  64.96  1004.88  49.37  427.28
22    25.37

In [None]:
3. Splitting datasets to train 

In [16]:
# NOTE:  Here we are considering only two column and 500 rows. Adding more coulmn increases computational time
# If you want to test full dataset you can use X_train = df[['X1', 'X2', 'X3', 'X4']][:500].values.reshape(9568, 4)

X_train = df[['X1', 'X2' ]][:500].values.reshape(500, 2)
y_train = df[['Y']][:500].values.reshape(500, 1)


print("\n Inputs ")
print(X_train)
print("\n Output ")
print(y_train)


 Inputs 
[[ 8.34 40.77]
 [23.64 58.49]
 [29.74 56.9 ]
 [19.07 49.69]
 [11.8  40.66]
 [13.97 39.16]
 [22.1  71.29]
 [14.47 41.76]
 [31.25 69.51]
 [ 6.77 38.18]
 [28.28 68.67]
 [22.99 46.93]
 [29.3  70.04]
 [ 8.14 37.49]
 [16.92 44.6 ]
 [22.72 64.15]
 [18.14 43.56]
 [11.49 44.63]
 [ 9.94 40.46]
 [23.54 41.1 ]
 [14.9  52.05]
 [33.8  64.96]
 [25.37 68.31]
 [ 7.29 41.04]
 [13.55 40.71]
 [ 6.39 35.57]
 [26.64 62.44]
 [ 7.84 41.39]
 [21.82 58.66]
 [27.17 67.45]
 [13.42 41.23]
 [20.77 56.85]
 [ 8.29 36.08]
 [30.98 67.45]
 [31.96 71.29]
 [15.83 52.75]
 [22.56 70.79]
 [25.91 75.6 ]
 [ 8.24 39.61]
 [24.66 60.29]
 [29.31 68.67]
 [21.48 66.91]
 [18.28 44.71]
 [26.96 65.34]
 [16.01 65.46]
 [27.37 63.73]
 [16.3  39.63]
 [23.8  48.6 ]
 [ 8.19 41.66]
 [25.28 67.69]
 [21.47 70.32]
 [30.54 67.45]
 [18.3  44.06]
 [25.82 72.39]
 [31.12 69.13]
 [15.99 39.63]
 [ 8.42 42.28]
 [23.7  69.23]
 [15.71 40.12]
 [29.11 74.9 ]
 [23.73 61.02]
 [28.26 65.34]
 [15.92 41.2 ]
 [33.4  70.8 ]
 [31.92 68.3 ]
 [26.87 74.99]


In [None]:
4. Implementing SVM Regression using SkLearn

In [4]:
from sklearn import svm
svmm = svm.SVR(kernel='rbf', C=1000)

In [None]:
5. Fitting data

In [5]:
svmm.fit(X_train, y_train.ravel())

SVR(C=1000, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [None]:
6. Testing - Predict a sample value

In [6]:
print(svmm.predict([[23.68,51.30]]))

[451.96979375]


In [None]:
7. Predict Training data 

In [8]:
# This is done for evaluation purpose
y_pred= svmm.predict(X_train)
print(y_pred)

[480.58045765 445.8500291  438.85998614 452.99015145 480.49053654
 471.05992589 442.25020217 463.89963444 428.6703736  484.21042174
 435.19045062 451.3096627  429.92707203 480.56044255 460.27015628
 453.02983375 461.61034289 470.98030562 473.64009332 448.65983157
 464.7196716  427.37975517 441.66003687 474.81017073 467.10969952
 487.59011857 438.76974114 485.7597861  450.40520011 429.77010874
 467.42171499 442.94976647 483.16024716 433.48977308 432.94007168
 458.49985453 435.23963549 443.30034331 477.99995947 445.15976374
 435.67000818 447.52012554 462.8218844  441.70996949 454.05990679
 437.13999188 464.00984469 440.98954326 485.10018038 445.44016424
 440.0999761  431.45008469 456.91737868 433.08040751 429.51000538
 465.04989563 481.80982108 437.25033013 462.50043773 432.52997947
 442.32024732 441.12989427 461.14261322 432.56000338 430.17035494
 437.88025235 482.14002874 462.09040156 457.61018915 438.41983891
 434.24996264 427.15026583 448.58964953 474.74261394 479.38006561
 446.74979

In [None]:
8. Evaluating the model - metrics.r2_score

In [12]:
from sklearn import metrics
from sklearn.metrics import r2_score
print(metrics.r2_score(y_train,y_pred))

0.9841354980827051


In [None]:
R2 Score > 0.70 is good

In [None]:
9. Evaluating the model - Root of r2_score

In [14]:
print(np.sqrt(metrics.r2_score(y_train,y_pred)))

0.9920360366855153
