#### Project Description - Predict Power Emission using Linear Regression

0. Load in the dataset from UCI ML Repository https://archive.ics.uci.edu/ml/index.php. 
0. Determine and evaluate a Baseline model
0. Build and evaluate a Linear Regression Model using SparkML

###1. Load Data

Import https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant


Data Set Information:

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (AT), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (PE) of the plant.
A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.
For comparability with our baseline studies, and to allow 5x2 fold statistical tests be carried out, we provide the data shuffled five times. For each shuffling 2-fold CV is carried out and the resulting 10 measurements are used for statistical testing.
We provide the data both in .ods and in .xlsx formats.


Attribute Information:

Features consist of hourly average ambient variables 
- Temperature (AT) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW
The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization.

In [3]:
# File location and type
file_location = "/FileStore/tables/powerplant.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ";"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location) \
  .na.drop()

display(df)

time,AT,V,AP,RH,PE
2017-01-01T00:00:00.000+0000,14.96,41.76,1024.07,73.17,463.26
2017-01-01T01:00:00.000+0000,25.18,62.96,1020.04,59.08,444.37
2017-01-01T02:00:00.000+0000,5.11,39.4,1012.16,92.14,488.56
2017-01-01T03:00:00.000+0000,20.86,57.32,1010.24,76.64,446.48
2017-01-01T04:00:00.000+0000,10.82,37.5,1009.23,96.62,473.9
2017-01-01T05:00:00.000+0000,26.27,59.44,1012.23,58.77,443.67
2017-01-01T06:00:00.000+0000,15.89,43.96,1014.02,75.24,467.35
2017-01-01T07:00:00.000+0000,9.48,44.71,1019.12,66.43,478.42
2017-01-01T08:00:00.000+0000,14.64,45.0,1021.78,41.25,475.98
2017-01-01T09:00:00.000+0000,11.74,43.56,1015.14,70.72,477.5


#####Examine distribution of target variable 'PE'

In [5]:
from pyspark.sql.functions import *
display(df.select('PE'))


PE
463.26
444.37
488.56
446.48
473.9
443.67
467.35
478.42
475.98
477.5


In [6]:
from pyspark.sql.functions import *

print(df.select(mean('PE')).show())
median=df.approxQuantile('PE',[0.5],0)
print(f'Median: {median}')

####2. Baseline Model

In [8]:
#Import dependencies
from pyspark.sql.functions import *

#Adding Average 
baseline = df.withColumn("averagePE", lit(454.0))

#Imputing Null Values
baseline = baseline.na.fill(454.0,'PE')

In [9]:
from pyspark.ml.evaluation import RegressionEvaluator

rmseRegressionEvaluator = RegressionEvaluator(labelCol='PE', metricName ='rmse', predictionCol='averagePE')
rmse=rmseRegressionEvaluator.evaluate(baseline)
# print(f'RMSE is: {rmse}')
r2RegressionEvaluator =RegressionEvaluator(labelCol='PE', metricName ='r2', predictionCol='averagePE')
r2=r2RegressionEvaluator.evaluate(baseline)
# print(f'R squared: {r2}')

# print(listItems)  
html = """
<body>
  <h2>Baseline Performance Metrics - RMSE and R2</h2>
  %s
</body>
""" % (f'RMSE: {rmse}     R2: {r2}')

displayHTML(html)

###3. Linear Regression Model

#####Splitting Data into Train and Test

In [12]:
(trainDF, testDF) = df.randomSplit([.8, .2], seed=42)

#####Build and train model

In [14]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

#Vectorising features columns in preparation for feeding into Linear Regression
#Define Vector Assembler
vecAssembler = VectorAssembler(inputCols = ["AT", "V", "AP", "RH"], outputCol = "features")

#Run train/test data through the assembler
vecTrainDF = vecAssembler.transform(trainDF)
vecTestDF = vecAssembler.transform(testDF)

#Create Linear Regression Model
lr = LinearRegression(featuresCol = "features", labelCol = "PE")

# Train Model
lrModel = lr.fit(vecTrainDF)

#####Make predictions using the test data

In [16]:
predDF=lrModel.transform(vecTestDF)

In [17]:
display(predDF)

time,AT,V,AP,RH,PE,features,prediction
2017-01-01T01:00:00.000+0000,25.18,62.96,1020.04,59.08,444.37,"List(1, 4, List(), List(25.18, 62.96, 1020.04, 59.08))",444.11690697929447
2017-01-01T02:00:00.000+0000,5.11,39.4,1012.16,92.14,488.56,"List(1, 4, List(), List(5.11, 39.4, 1012.16, 92.14))",483.5504389801087
2017-01-01T03:00:00.000+0000,20.86,57.32,1010.24,76.64,446.48,"List(1, 4, List(), List(20.86, 57.32, 1010.24, 76.64))",450.5200580883712
2017-01-01T13:00:00.000+0000,25.71,58.59,1012.77,61.83,451.28,"List(1, 4, List(), List(25.71, 58.59, 1012.77, 61.83))",443.1614411346432
2017-01-01T16:00:00.000+0000,18.21,45.0,1022.86,48.84,467.54,"List(1, 4, List(), List(18.21, 45.0, 1022.86, 48.84))",463.9577816353353
2017-01-02T13:00:00.000+0000,16.38,47.45,1010.08,88.86,450.69,"List(1, 4, List(), List(16.38, 47.45, 1010.08, 88.86))",459.6878734912477
2017-01-03T04:00:00.000+0000,32.57,78.92,1011.6,66.47,430.12,"List(1, 4, List(), List(32.57, 78.92, 1011.6, 66.47))",424.0152252070016
2017-01-03T05:00:00.000+0000,8.11,42.18,1014.82,93.09,473.62,"List(1, 4, List(), List(8.11, 42.18, 1014.82, 93.09))",476.95425994176566
2017-01-03T07:00:00.000+0000,23.04,59.43,1010.23,68.99,442.99,"List(1, 4, List(), List(23.04, 59.43, 1010.23, 68.99))",446.9502678958327
2017-01-03T13:00:00.000+0000,29.01,65.71,1013.61,48.07,446.22,"List(1, 4, List(), List(29.01, 65.71, 1013.61, 48.07))",437.2668954252651


#####Evaluate the model

In [19]:
from pyspark.ml.evaluation import RegressionEvaluator

rmseRegressionEvaluator = RegressionEvaluator(labelCol='PE', metricName ='rmse', predictionCol='prediction')
rmse=rmseRegressionEvaluator.evaluate(predDF)
print({rmse})
r2RegressionEvaluator =RegressionEvaluator(labelCol='PE', metricName ='r2', predictionCol='prediction')
r2=r2RegressionEvaluator.evaluate(predDF)
print({r2})



# print(listItems)  
html = """
<body>
  <h2>Linear Regression's Performance Metrics - RMSE and R2</h2>
  %s
</body>
""" % (f'RMSE: {rmse}     R2: {r2}')

displayHTML(html)

####By comparing the performance metrics we can see the Linear Regression model performs way better than using the Baseline to predict PE as expected

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>