# **Homework 5**

Using the provided New York City (NYC) dataset. Apply Linear Regression to predict housing prices. For each house observation, we have the following information:

**crim** — per capita crime rate by town.

**zn** — proportion of residential land zoned for lots over 25,000 sq.ft.

**indus** — proportion of non-retail business acres per town.

**chas** — charles river dummy variable (= 1 if tract bounds river; 0 otherwise).

**nox** — nitrogen oxides concentration (parts per 10 million).

**rm** — average number of rooms per dwelling.

**age** — proportion of owner-occupied units built prior to 1940.

**dis** — weighted mean of distances to five employment centres.

**rad** — index of accessibility to radial highways.

**tax** — full-value property-tax rate per 10,000 USD.

**ptratio** — pupil-teacher ratio by town.

**black** — 1000(bk — 0.63)² where bk is the 
proportion of blacks by town.

**lstat** — lower status of the population (percent).

**medv** — median value of owner-occupied homes in 
$1000s. this is the target variable.

The input data set contains data about details of various houses. Based on the information provided, the goal is to come up with a model to predict median value of a given house in the area.

**Hint: You should use medv as the independent variable, and remember to calculate the MSERMSE**


In [1]:
#Step 1: Install Dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz
!pip install -q findspark

#Step 2: Add environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"

#Step 3: Initialize Pyspark
import findspark
findspark.init()

In [2]:
#creating spark context
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
sc

In [33]:
df = spark.read.csv('nyc.csv',inferSchema=True,header=True)
df.show()

+---+-------+----+-----+----+-----+-----+-----+------+---+---+-------+------+-----+----+
|_c0|   crim|  zn|indus|chas|  nox|   rm|  age|   dis|rad|tax|ptratio| black|lstat|medv|
+---+-------+----+-----+----+-----+-----+-----+------+---+---+-------+------+-----+----+
|  1|0.00632|18.0| 2.31|   0|0.538|6.575| 65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|
|  2|0.02731| 0.0| 7.07|   0|0.469|6.421| 78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|
|  3|0.02729| 0.0| 7.07|   0|0.469|7.185| 61.1|4.9671|  2|242|   17.8|392.83| 4.03|34.7|
|  4|0.03237| 0.0| 2.18|   0|0.458|6.998| 45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|
|  5|0.06905| 0.0| 2.18|   0|0.458|7.147| 54.2|6.0622|  3|222|   18.7| 396.9| 5.33|36.2|
|  6|0.02985| 0.0| 2.18|   0|0.458| 6.43| 58.7|6.0622|  3|222|   18.7|394.12| 5.21|28.7|
|  7|0.08829|12.5| 7.87|   0|0.524|6.012| 66.6|5.5605|  5|311|   15.2| 395.6|12.43|22.9|
|  8|0.14455|12.5| 7.87|   0|0.524|6.172| 96.1|5.9505|  5|311|   15.2| 396.9|19.15|27.1|
|  9|0.21124|12.5| 7.

In [48]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["_c0", "crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat"],
    outputCol="features")

output = assembler.transform(df)

In [60]:
# Splitting 70:30
training,testing = output.select("features", "medv").randomSplit([0.7,0.3])

In [65]:
print("Training: assembled columns '_c0', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat' to column 'medv'")
#training = output.select("features", "medv")
training.show(5)
training.describe().show()

Training: assembled columns '_c0', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat' to column 'medv'
+--------------------+----+
|            features|medv|
+--------------------+----+
|[1.0,0.00632,18.0...|24.0|
|[4.0,0.03237,0.0,...|33.4|
|[5.0,0.06905,0.0,...|36.2|
|[6.0,0.02985,0.0,...|28.7|
|[7.0,0.08829,12.5...|22.9|
+--------------------+----+
only showing top 5 rows

+-------+------------------+
|summary|              medv|
+-------+------------------+
|  count|               348|
|   mean|23.209195402298857|
| stddev| 9.484501322947517|
|    min|               5.0|
|    max|              50.0|
+-------+------------------+



In [67]:
print("Testing: assembled columns '_c0', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat' to column 'medv'")
#training = output.select("features", "medv")
testing.show(5)
testing.describe().show()

Testing: assembled columns '_c0', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat' to column 'medv'
+--------------------+----+
|            features|medv|
+--------------------+----+
|[2.0,0.02731,0.0,...|21.6|
|[3.0,0.02729,0.0,...|34.7|
|[8.0,0.14455,12.5...|27.1|
|[13.0,0.09378,12....|21.7|
|[14.0,0.62976,0.0...|20.4|
+--------------------+----+
only showing top 5 rows

+-------+------------------+
|summary|              medv|
+-------+------------------+
|  count|               158|
|   mean|21.043037974683546|
| stddev| 8.367272969968337|
|    min|               5.0|
|    max|              50.0|
+-------+------------------+



In [45]:
from pyspark.ml.regression import LinearRegression

# These are the default values for the featuresCol, labelCol, predictionCol
lr = LinearRegression(featuresCol='features', labelCol='medv', predictionCol='prediction')

In [69]:
# Fit the model
lrTrainModel = lr.fit(training)

In [53]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {}".format(str(lrTrainModel.coefficients))) # For each feature...
print('\n')
print("Intercept:{}".format(str(lrTrainModel.intercept)))

Coefficients: [-0.003380495291687249,-0.15433829796305748,0.03787705411693594,0.01218177767061619,3.0944802157043014,-14.974197553555221,4.238395727490869,-0.0074360837076530555,-1.3592176193231333,0.3201799140143225,-0.008831143716188709,-0.958099411612571,0.008676192750111111,-0.5208040354058792]


Intercept:32.19803900246931


In [56]:
# Summarize the model over the training set and print out some metrics
trainingSummary = lrTrainModel.summary

trainingSummary.residuals.show()
print("Train RMSE: {}".format(trainingSummary.rootMeanSquaredError))
print("Train MSE: {}".format(trainingSummary.meanSquaredError))
print("Train r2: {}".format(trainingSummary.r2))

+--------------------+
|           residuals|
+--------------------+
|  -6.568128055875576|
| -3.6594372141441056|
|  3.4474470521512863|
|  3.9959445775218967|
|   7.460955028408318|
|   2.992300907311847|
|   7.110227527859998|
|   4.728118699197189|
| -0.5500388138418408|
|  -4.595918531167037|
|  -3.174639595171623|
| 0.07905648929442322|
|  0.3560829035769473|
| -1.4392777417376301|
|  0.1470161908838108|
|  1.8776592819652862|
| 0.28295771415391613|
|   3.586176522791515|
|-0.42442549589911494|
| -0.2509293313240182|
+--------------------+
only showing top 20 rows

Train RMSE: 4.568607494111862
Train MSE: 20.872174435255065
Train r2: 0.7463540707642433


In [55]:
test_results = lrTrainModel.evaluate(testing)

In [73]:
unlabeled_data = testing.select('features')
predictions = lrTrainModel.transform(unlabeled_data)
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[2.0,0.02731,0.0,...|25.298532322969642|
|[3.0,0.02729,0.0,...| 31.91133084620289|
|[8.0,0.14455,12.5...|  19.1188993193866|
|[13.0,0.09378,12....|21.246849749828947|
|[14.0,0.62976,0.0...| 20.16687870223637|
|[15.0,0.63796,0.0...|19.699439023758103|
|[27.0,0.67191,0.0...| 15.49869186928133|
|[28.0,0.95577,0.0...|15.239884249994871|
|[32.0,1.35472,0.0...|18.168623223627215|
|[35.0,1.61282,0.0...|14.340514398144315|
|[36.0,0.06417,0.0...| 23.98757028116823|
|[37.0,0.09744,0.0...| 22.61826223851545|
|[47.0,0.18836,0.0...|20.795948327784387|
|[48.0,0.22927,0.0...| 17.72390814190196|
|[49.0,0.25387,0.0...| 8.109208006807421|
|[50.0,0.21977,0.0...| 16.81493533273853|
|[52.0,0.04337,21....| 23.69170455365076|
|[54.0,0.04981,21....|24.409303599692123|
|[57.0,0.02055,85....| 24.61510249657352|
|[60.0,0.10328,25....| 21.24484698293001|
+--------------------+------------

In [75]:
test_results.residuals.show(5)

print("Test RMSE: {}".format(test_results.rootMeanSquaredError))
print("Test MSE: {}".format(test_results.meanSquaredError))
print("Test r2: {}".format(test_results.r2))

+-------------------+
|          residuals|
+-------------------+
|-0.6616761319407019|
|  1.137533385260019|
| 1.7472328953068086|
|-0.8615770581114504|
| 0.6456943483903608|
+-------------------+
only showing top 5 rows

Test RMSE: 5.007847507799957
Test MSE: 25.07853666137824
Test r2: 0.720618033883305
