# 3.1.4 Challenge: Model Comparison
Here let's work on regression. Find a data set and build a KNN Regression and an OLS regression. Compare the two. How similar are they? Do they miss in different ways?

In [1]:
import pandas as pd
import numpy as np
import scipy
import math
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import neighbors
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std

%matplotlib inline

__Source:__  UCI Machine Learning - [Combined Cycle Power Plant Data Set](http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant#)

__Attribute Information:__

Features consist of hourly average ambient variables 
- Temperature (AT) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW

The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization.



In [2]:
df = pd.read_excel('CCPP/Folds5x2_pp.xlsx')
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


In [3]:
df.describe()

Unnamed: 0,AT,V,AP,RH,PE
count,9568.0,9568.0,9568.0,9568.0,9568.0
mean,19.651231,54.305804,1013.259078,73.308978,454.365009
std,7.452473,12.707893,5.938784,14.600269,17.066995
min,1.81,25.36,992.89,25.56,420.26
25%,13.51,41.74,1009.1,63.3275,439.75
50%,20.345,52.08,1012.94,74.975,451.55
75%,25.72,66.54,1017.26,84.83,468.43
max,37.11,81.56,1033.3,100.16,495.76


### OLS Regression

In [4]:
#building ordinary least squares model
linear_formula = 'PE ~ AT+V+AP+RH'

lm = smf.ols(formula=linear_formula, data=df).fit()


In [5]:
lm.params

Intercept    454.609274
AT            -1.977513
V             -0.233916
AP             0.062083
RH            -0.158054
dtype: float64

In [6]:
lm.pvalues

Intercept     0.000000e+00
AT            0.000000e+00
V            4.375305e-215
AP            5.507109e-11
RH           3.104584e-293
dtype: float64

In [7]:
lm.rsquared

0.9286960898122536

In [8]:
lm.conf_int()

Unnamed: 0,0,1
Intercept,435.500123,473.718425
AT,-2.007483,-1.947543
V,-0.248191,-0.219642
AP,0.043543,0.080623
RH,-0.166225,-0.149883


In [9]:
correlation_matrix = df.corr()
print(correlation_matrix)

          AT         V        AP        RH        PE
AT  1.000000  0.844107 -0.507549 -0.542535 -0.948128
V   0.844107  1.000000 -0.413502 -0.312187 -0.869780
AP -0.507549 -0.413502  1.000000  0.099574  0.518429
RH -0.542535 -0.312187  0.099574  1.000000  0.389794
PE -0.948128 -0.869780  0.518429  0.389794  1.000000


Features V and AT are strongly correlated with one another.  I will run further model iterations eliminating one or both features from the model.

In [10]:
#building ordinary least squares model without AT
linear_formula2 = 'PE ~ V+AP+RH'

lm2 = smf.ols(formula=linear_formula2, data=df).fit()

In [11]:
lm2.params

Intercept   -74.972282
V            -1.001427
AP            0.564457
RH            0.160676
dtype: float64

In [12]:
lm2.pvalues

Intercept     3.265612e-07
V             0.000000e+00
AP            0.000000e+00
RH           2.744155e-175
dtype: float64

In [13]:
lm2.conf_int()

Unnamed: 0,0,1
Intercept,-103.725956,-46.218607
V,-1.015146,-0.987709
AP,0.53643,0.592485
RH,0.14975,0.171603


In [14]:
lm2.rsquared

0.8039581703127988

Removing feature AT reduced the predictive capability of the model. What if we try to remove feature V instead?

In [15]:
#building ordinary least squares model without V
linear_formula3 = 'PE ~ AT+AP+RH'

lm3 = smf.ols(formula=linear_formula3, data=df).fit()

In [16]:
lm3.params

Intercept    490.323746
AT            -2.377708
AP             0.025372
RH            -0.203832
dtype: float64

In [17]:
lm3.pvalues

Intercept    0.000000
AT           0.000000
AP           0.010255
RH           0.000000
dtype: float64

In [18]:
lm3.conf_int()

Unnamed: 0,0,1
Intercept,470.342367,510.305126
AT,-2.395992,-2.359424
AP,0.006002,0.044743
RH,-0.211913,-0.19575


In [19]:
lm3.rsquared

0.9210025307676984

Removing feature V maintained most of the predictive ability of the model with an R-squared value of 0.921.  

In [20]:
#building OLS without V and AT
linear_formula4 = 'PE ~ AP+RH'

lm4 = smf.ols(formula=linear_formula4, data=df).fit()

In [21]:
lm4.params

Intercept   -985.494851
AP             1.392132
RH             0.399265
dtype: float64

In [22]:
lm4.pvalues

Intercept    0.0
AP           0.0
RH           0.0
dtype: float64

In [23]:
lm4.conf_int()

Unnamed: 0,0,1
Intercept,-1031.406006,-939.583696
AP,1.346709,1.437555
RH,0.380789,0.417741


In [24]:
lm4.rsquared

0.3842741215943508

Removing both AT and V lost predictive ability of the model based on the lower R-squared value.  Linear Model 3 best predicts energy output of the power plant.

### KNN Regression

In [25]:
#building KNN regression with k=5 without weighting
knn = neighbors.KNeighborsRegressor(n_neighbors=5)
X = df[['AT', 'V', 'AP', 'RH']]
Y = df['PE']
knn.fit(X, Y)

T = pd.DataFrame(np.arange(0, 100, 0.1)[:, np.newaxis])
T[1] = T[0]
T[2] = T[0]
T[3] = T[0]

Y_ = knn.predict(T)

score = cross_val_score(knn, X, Y, cv=5)
print('Unweighted Accuracy: %0.2f (+/- %0.2f)' % (score.mean(), score.std()))

Unweighted Accuracy: 0.95 (+/- 0.00)


In [26]:
#building KNN regression with normalized data and k=5 without weighting
knn_n = neighbors.KNeighborsRegressor(n_neighbors=5)
X = df[['AT', 'V', 'AP', 'RH']]
normalized_X = preprocessing.normalize(X)
Y = df['PE']
knn_n.fit(normalized_X, Y)

T = pd.DataFrame(np.arange(0, 1, 0.01)[:, np.newaxis])
T[1] = T[0]
T[2] = T[0]
T[3] = T[0]

Y_ = knn_n.predict(T)

score = cross_val_score(knn_n, normalized_X, Y, cv=5)
print('Unweighted Normalized Accuracy: %0.2f (+/- %0.2f)' % (score.mean(), score.std()))

Unweighted Normalized Accuracy: 0.94 (+/- 0.00)


In [27]:
#repeat model with distance weights
knn_w = neighbors.KNeighborsRegressor(n_neighbors=5, weights='distance')
X = df[['AT', 'V', 'AP', 'RH']]
normalized_X = preprocessing.normalize(X)
Y = df['PE']
knn_w.fit(normalized_X, Y)

T = pd.DataFrame(np.arange(0, 1, 0.01)[:, np.newaxis])
T[1] = T[0]
T[2] = T[0]
T[3] = T[0]

Y_w = knn_w.predict(T)

score = cross_val_score(knn_w, normalized_X, Y, cv=5)
print('Weighted Normalized Accuracy: %0.2f (+/- %0.2f)' % (score.mean(), score.std()))

Weighted Normalized Accuracy: 0.94 (+/- 0.00)


I'm concerned that both of the weighted and unweighted normalied data produced the same accuracy score and that none of the scores had a significant standard deviation.  

#### Summary
Based on my not trusting the KNN regression results, I would say that the OLS model better describes the model's behavior.  It clearly provides context for the changes in each variable and how they impact the overall power output of the power plant. My linear model 3 best modeled the data without overfitting.  This model suggests that hourly electrical energy output is related to ambient temperature, pressure, and relative humidity.  Higher temperatures and higher relative humidity produced lower energy outputs, whereas higher pressure increased energy outputs.  This data provides information about how important it is to maintain constant ambient conditions to optimize electrical energy outputs.  

Because the data has such wide ranges of not only individual measurements, but also wide ranges between the types of measurements, even the normalized data was biased to the high or low end of the range. For this reason, the OLS model was better able to identify contributions of each feature to the target data.