# USA Housing Predictions

### Start a simple Spark Session

In [2]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()

import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@70e6b3f6


### Initialize logger to ERROR to see less warnings

In [3]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Using Spark to read in the USA Housing csv file

In [4]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("USA_Housing.csv")

data: org.apache.spark.sql.DataFrame = [Avg Area Income: double, Avg Area House Age: double ... 4 more fields]


### Check out the Schema

In [5]:
data.printSchema

root
 |-- Avg Area Income: double (nullable = true)
 |-- Avg Area House Age: double (nullable = true)
 |-- Avg Area Number of Rooms: double (nullable = true)
 |-- Avg Area Number of Bedrooms: double (nullable = true)
 |-- Area Population: double (nullable = true)
 |-- Price: double (nullable = true)



### Show

In [6]:
data.show(3)

+------------------+------------------+------------------------+---------------------------+------------------+------------------+
|   Avg Area Income|Avg Area House Age|Avg Area Number of Rooms|Avg Area Number of Bedrooms|   Area Population|             Price|
+------------------+------------------+------------------------+---------------------------+------------------+------------------+
| 79545.45857431678| 5.682861321615587|       7.009188142792237|                       4.09|23086.800502686456|1059033.5578701235|
| 79248.64245482568|6.0028998082752425|       6.730821019094919|                       3.09| 40173.07217364482|  1505890.91484695|
|61287.067178656784| 5.865889840310001|       8.512727430375099|                       5.13| 36882.15939970458|1058987.9878760849|
+------------------+------------------+------------------------+---------------------------+------------------+------------------+
only showing top 3 rows



### See an example of what the data looks like by printing out a Row

In [7]:
val colnames = data.columns

colnames: Array[String] = Array(Avg Area Income, Avg Area House Age, Avg Area Number of Rooms, Avg Area Number of Bedrooms, Area Population, Price)


In [8]:
val firstRow = data.head(1)(0)

firstRow: org.apache.spark.sql.Row = [79545.45857431678,5.682861321615587,7.009188142792237,4.09,23086.800502686456,1059033.5578701235]


In [9]:
println("Example Data Row")
for(ind <- Range(1,colnames.length)){
  println(colnames(ind))
  println(firstRow(ind))
  println("\n")
}

Example Data Row
Avg Area House Age
5.682861321615587


Avg Area Number of Rooms
7.009188142792237


Avg Area Number of Bedrooms
4.09


Area Population
23086.800502686456


Price
1059033.5578701235




## Setting Up DataFrame for Machine Learning

A few things we need to do before Spark can accept the data!
```
 It needs to be in the form of two columns
 **("label","features")**
```

This will allow us to join multiple feature columns into a single column of an array of feautre values

### Imports for Linear Regression ML

In [10]:
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder,TrainValidationSplit}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors


### Rename Price to label column for naming convention. Grab only numerical columns from the data

In [11]:
val df = data.select(data("Price").as("label"), $"Avg Area Income", $"Avg Area House Age", $"Avg Area Number of Bedrooms",
                     $"Avg Area Number of Rooms",$"Area Population")

df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: double ... 4 more fields]


In [12]:
df.show(3)

+------------------+------------------+------------------+---------------------------+------------------------+------------------+
|             label|   Avg Area Income|Avg Area House Age|Avg Area Number of Bedrooms|Avg Area Number of Rooms|   Area Population|
+------------------+------------------+------------------+---------------------------+------------------------+------------------+
|1059033.5578701235| 79545.45857431678| 5.682861321615587|                       4.09|       7.009188142792237|23086.800502686456|
|  1505890.91484695| 79248.64245482568|6.0028998082752425|                       3.09|       6.730821019094919| 40173.07217364482|
|1058987.9878760849|61287.067178656784| 5.865889840310001|                       5.13|       8.512727430375099| 36882.15939970458|
+------------------+------------------+------------------+---------------------------+------------------------+------------------+
only showing top 3 rows



- **An assembler converts the input values to a vector**
- **A vector is what the ML algorithm reads to train a model**

### <br>Set the input columns from which we are supposed to read the values </br>
### Set the name of the column where the vector will be stored

In [13]:
val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Bedrooms"
                                                        ,"Avg Area Number of Rooms","Area Population")).setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_0aa8a599ed80


### Use the assembler to transform our DataFrame to the two columns

In [14]:
val output = assembler.transform(df).select($"label",$"features")

output: org.apache.spark.sql.DataFrame = [label: double, features: vector]


In [15]:
output.show(5,false)

+------------------+-------------------------------------------------------------------------------+
|label             |features                                                                       |
+------------------+-------------------------------------------------------------------------------+
|1059033.5578701235|[79545.45857431678,5.682861321615587,4.09,7.009188142792237,23086.800502686456]|
|1505890.91484695  |[79248.64245482568,6.0028998082752425,3.09,6.730821019094919,40173.07217364482]|
|1058987.9878760849|[61287.067178656784,5.865889840310001,5.13,8.512727430375099,36882.15939970458]|
|1260616.8066294468|[63345.24004622798,7.1882360945186425,3.26,5.586728664827653,34310.24283090706]|
|630943.4893385402 |[59982.19722570803,5.040554523106283,4.23,7.839387785120487,26354.109472103148]|
+------------------+-------------------------------------------------------------------------------+
only showing top 5 rows



### Splitting the resultane data into training data and testing data

<code>
<b>Training data is to train the model</b>
<b>Testing data is to test the builted model</b>
</code>

#### Splitting the total data to 70% and 30% for training data and testing data respectively

In [16]:
val Array(train_data, test_data) = output.randomSplit(Array(0.7,0.3))

train_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]
test_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]


In [22]:
train_data.describe().show()

+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|              3518|
|   mean|1230560.7797947223|
| stddev|352169.89182459255|
|    min|15938.657923287848|
|    max|2370231.3201015536|
+-------+------------------+



In [23]:
test_data.describe().show()

+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|              1482|
|   mean|1235661.5704412628|
| stddev|355450.80578399764|
|    min|211017.97049475575|
|    max|2469065.5941747027|
+-------+------------------+



### Creating a linear regression model object

In [24]:
val lr = new LinearRegression().setLabelCol("label").setFeaturesCol("features")

lr: org.apache.spark.ml.regression.LinearRegression = linReg_fd5615b54f72


### Creating a linear regression model and fitting the training data to it

In [25]:
val lrModel = lr.fit(train_data)

2019-12-26 11:16:54 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-12-26 11:16:54 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
2019-12-26 11:16:54 WARN  LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
2019-12-26 11:16:54 WARN  LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK


lrModel: org.apache.spark.ml.regression.LinearRegressionModel = linReg_fd5615b54f72


### Getting the training summary of the created model

In [27]:
val trainingSummary = lrModel.summary

trainingSummary: org.apache.spark.ml.regression.LinearRegressionTrainingSummary = org.apache.spark.ml.regression.LinearRegressionTrainingSummary@15dc3f01


### Residuals

In [29]:
trainingSummary.residuals.show(5)

+------------------+
|         residuals|
+------------------+
|  -53139.342100527|
|-78122.98662759717|
|-68321.06200949212|
| 51200.89641396806|
|-220668.0145468796|
+------------------+
only showing top 5 rows



### Errors

In [33]:
println(s"Mean Absolute Error: ${trainingSummary.meanAbsoluteError}")
println(s"Mean Squared Error: ${trainingSummary.meanSquaredError}")
println(s"Root Mean Squared Error: ${trainingSummary.rootMeanSquaredError}")
println(s"R Squared Error: ${trainingSummary.r2}")

Mean Absolute Error: 81244.25778543952
Mean Squared Error: 1.0179945237000706E10
Root Mean Squared Error: 100895.7146612318
R Squared Error: 0.9178959726345578


### Evaluating the model against test data

In [35]:
val test_results = lrModel.evaluate(test_data)

test_results: org.apache.spark.ml.regression.LinearRegressionSummary = org.apache.spark.ml.regression.LinearRegressionSummary@153ce85a


### Getting the co-effecients and intercept

In [36]:
println(s"Coeffecients: ${lrModel.coefficients}")
println(s"Intercept: ${lrModel.intercept}")

Coeffecients: [21.45412372772751,163929.87098514492,762.0527910566846,122231.78361454248,15.272557127819224]
Intercept: -2629250.4438277474


### Getting the residuals

In [38]:
test_results.residuals.show(5)

+-------------------+
|          residuals|
+-------------------+
|-260806.03998372125|
| -63123.19783675301|
|  1119.126134990016|
| -66822.19692723238|
| 31521.908127737057|
+-------------------+
only showing top 5 rows



### Evaluating the model by checking the different types of error

In [39]:
println(s"Mean Absolute Error: ${test_results.meanAbsoluteError}")
println(s"Mean Squared Error: ${test_results.meanSquaredError}")
println(s"Root Mean Squared Error: ${test_results.rootMeanSquaredError}")
println(s"R Squared Error: ${test_results.r2}")

Mean Absolute Error: 81740.34615371792
Mean Squared Error: 1.0340750449878592E10
Root Mean Squared Error: 101689.4805271351
R Squared Error: 0.9180995672498196


### Getting the predictions from the builted model without label column

In [40]:
val unlabelled_data = test_data.select("features")

unlabelled_data: org.apache.spark.sql.DataFrame = [features: vector]


In [41]:
val predictions = lrModel.transform(unlabelled_data)

predictions: org.apache.spark.sql.DataFrame = [features: vector, prediction: double]


In [43]:
predictions.show(10,false)

+---------------------------------------------------------------------------------+------------------+
|features                                                                         |prediction        |
+---------------------------------------------------------------------------------+------------------+
|[50926.776633862784,4.507953423649012,4.01,6.154788047780713,33663.6692440211]   |471824.010478477  |
|[62173.580099008206,5.098958554487097,3.14,5.662268147045037,3883.448164008629]  |294313.0188266118 |
|[46367.20585888387,5.290720475879383,4.5,5.181613803996049,26015.296446574837]   |266931.6886085239 |
|[49851.134783967645,4.6849956831963695,3.04,5.259695411851781,32511.846268200865]|350030.3291141116 |
|[51144.8509024324,4.1916217441472075,3.03,6.7018691307537654,18277.601364263126] |255785.67556118593|
|[41240.05727656731,5.81593445999404,4.15,5.2116956265858665,40888.07856106315]   |473587.12005436234|
|[54994.91828997551,5.186801231971905,2.2,6.063933310548773,15444.4826129

### Stopping the created spark session

In [44]:
spark.stop()

## Thank You!