by *canaytore*

# Regression

Source: Power Plant -> https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

We're going to use this data to predict how much power a power plant can generate, based on some factors.
- The first column is labeled AT, that's for "ambient temperature",
- The second column is headed with the letter V, and that stands for "vacuum", 
- The third column has AP, which is "ambient pressure",
- The fourth column is RH for relative "humidity",
- The fifth and final column is labeled with PE and that's a measure of how much power was generated. We'll be using that as the value that we're trying to predict. 

## Linear regression

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('regression').getOrCreate()

In [2]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

In [3]:
pp_df = spark.read.csv("/power_plant.csv", 
                       header=True, 
                       inferSchema=True)

In [4]:
pp_df

DataFrame[AT: double, V: double, AP: double, RH: double, PE: double]

In [5]:
vectorAssembler = VectorAssembler(inputCols=["AT","V","AP","RH"], outputCol="features")

In [6]:
vpp_df = vectorAssembler.transform(pp_df)

In [7]:
vpp_df.show(1)

+-----+-----+-------+-----+------+--------------------+
|   AT|    V|     AP|   RH|    PE|            features|
+-----+-----+-------+-----+------+--------------------+
|14.96|41.76|1024.07|73.17|463.26|[14.96,41.76,1024...|
+-----+-----+-------+-----+------+--------------------+
only showing top 1 row



In [8]:
lr = LinearRegression(featuresCol="features", labelCol="PE") #we're trying to predict PE

In [9]:
lr_model = lr.fit(vpp_df)

Let's take a look at some features of that model. For example, linear models have coefficients. So this is a list of four numbers which correspond to the coefficients of the different variables that we were using to build the model. Now another important part of a linear model is the intercept. So if we look at the intercept, that gives us a point where the line crosses the Y axis. So basically, what we've done is we've fit a line to our data. 

In [10]:
lr_model.coefficients

DenseVector([-1.9775, -0.2339, 0.0621, -0.1581])

In [11]:
lr_model.intercept

454.6092744523414

Now one of the important measures of the quality of a linear model is the error. And there are different ways of measuring it. We're going to use the root mean squared error. It's a measure of how much error there is in our predictions. 

In [12]:
lr_model.summary.rootMeanSquaredError

4.557126016749488

But why are we squaring the error? Well, that's because when we make a prediction, sometimes we can overshoot, in which case our error would be positive, but sometimes we might underestimate in which case our error would be negative. If we start adding up positive and negative numbers, they tend to cancel each other out. So the first thing we do is we square the error, then it doesn't matter if it's positive or negative. We're going to have a positive value to work with. We add all those up and take their square root, we get a good measure of error.

In [13]:
lr_model.save("lr.model") #save the model to be able to work with it later

## Decision tree regression

In [14]:
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler

Now we're going to consider decision tree regression. Regression is lot like classification, in the sense that we have a number of different algorithms we can use to perform regression and sometimes it helps to experiment with different algorithms to see which works best with your dataset.

In [15]:
vectorAssembler = VectorAssembler(inputCols=["AT","V","AP","RH"], outputCol="features")

In [16]:
vpp_df = vectorAssembler.transform(pp_df)

In [17]:
vpp_df.show(1)

+-----+-----+-------+-----+------+--------------------+
|   AT|    V|     AP|   RH|    PE|            features|
+-----+-----+-------+-----+------+--------------------+
|14.96|41.76|1024.07|73.17|463.26|[14.96,41.76,1024...|
+-----+-----+-------+-----+------+--------------------+
only showing top 1 row



In [18]:
splits = vpp_df.randomSplit([0.7,0.3])

In [19]:
train_df = splits[0]
test_df = splits[1]

In [20]:
train_df.count()

6676

In [21]:
test_df.count()

2892

In [22]:
vpp_df.count()

9568

In [23]:
dt = DecisionTreeRegressor(featuresCol="features", labelCol="PE")

In [24]:
dt_model = dt.fit(train_df)

In [25]:
dt_predictions = dt_model.transform(test_df)

In [26]:
dt_evaluator = RegressionEvaluator(labelCol="PE", predictionCol="prediction", metricName="rmse")

In [27]:
rmse = dt_evaluator.evaluate(dt_predictions)
rmse

4.505463629189437

## Gradient-boosted tree regression

In [28]:
from pyspark.ml.regression import GBTRegressor

In [29]:
gbt = GBTRegressor(featuresCol="features", labelCol="PE")

In [30]:
gbt_model = gbt.fit(train_df)

In [31]:
gbt_predictions = gbt_model.transform(test_df)

In [32]:
gbt_evaluator = RegressionEvaluator(labelCol="PE", predictionCol="prediction", metricName="rmse")

In [33]:
gbt_rmse = gbt_evaluator.evaluate(gbt_predictions)
gbt_rmse

4.055289723264364

In [34]:
rmse #rmse for decision tree regressor

4.505463629189437

*Notes:*
- Regression algorithms are designed to make numeric projections. 
- Gradient-boosted tree regression can sometimes give the best performance, but it may take longer to build models. Like with classification, it helps to experiment with different regression algorithms with your data set. And it also helps to try varying hyperparameters to see if you can tune performance slightly by changing some of those configuration parameters. 
- In general, it's best to start with linear regression. In real-word examples, linear regression frequently gives usable, high-quality results. Now, if you have a data set that doesn't work well with linear regression, then try decision tree regression. That might give you better results. Now, if you need to get the best performing model possible, and you're willing to spend extra time building the model, then take a look at gradient-boosted tree regression.