# Open Team Exercise: Predicting House Prices

![](graphics/house-for-sale-sign.jpg)

In this exercise, we are going to build another machine learning model. 

## Classification vs Regression

We speak of **classification** if the model outputs a _categorical_ variable, i.e. assigns labels to data points that divide them into groups. The machine learning algorithm often performs this task by creating and optimizing a **decision boundary** in the feature space that separates classes. (The previous chapter introduced an example of a predictive classification model.)

We speak of **regression** if the target variable is a _continuous_ value. This is the task of [📓fitting](../stats/stats-fitting-short.ipynb) a function to the data points so that it enables prediction.

![](https://upload.wikimedia.org/wikipedia/commons/1/13/Main-qimg-48d5bd214e53d440fa32fc9e5300c894.png)
**classification**
_Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Main-qimg-48d5bd214e53d440fa32fc9e5300c894.png)_

![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/500px-Linear_regression.svg.png) **regression** _Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Linear_regression.svg)

## Loading the Data

In [10]:
import findspark
findspark.init()
import pyspark

In [11]:
data_dir = "../.assets/data/house/"

In [12]:
!ls {data_dir}

[31mdata_description.txt[m[m [31mprices.csv[m[m


In [13]:
!head {data_dir}/data_description.txt

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES


In [14]:
!head {data_dir}/prices.csv

Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1,60,RL,65,8450,Pave,NA,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PCo



After creating a `SparkSession`, we read the contents of the .csv file into a DataFrame. For that we also need to define its schema.

In [15]:
spark = pyspark.sql.SparkSession \
    .builder \
    .appName("HousePricePredictor") \
    .getOrCreate()


In [20]:
data = spark.read \
    .format("csv") \
    .option("header", "true") \
    .load(f"{data_dir}/prices.csv") 


In [24]:
data[["OverallQual", "OverallCond", "YearBuilt", "SalePrice"]].show()

+-----------+-----------+---------+---------+
|OverallQual|OverallCond|YearBuilt|SalePrice|
+-----------+-----------+---------+---------+
|          7|          5|     2003|   208500|
|          6|          8|     1976|   181500|
|          7|          5|     2001|   223500|
|          7|          5|     1915|   140000|
|          8|          5|     2000|   250000|
|          5|          5|     1993|   143000|
|          8|          5|     2004|   307000|
|          7|          6|     1973|   200000|
|          7|          5|     1931|   129900|
|          5|          6|     1939|   118000|
|          5|          5|     1965|   129500|
|          9|          5|     2005|   345000|
|          5|          6|     1962|   144000|
|          7|          5|     2006|   279500|
|          6|          5|     1960|   157000|
|          7|          8|     1929|   132000|
|          6|          7|     1970|   149000|
|          4|          5|     1967|    90000|
|          5|          5|     2004

## Useful Hints

- start by building a **minimal viable model** that uses a few strong features - then add more features to improve performance
- `pyspark.ml` provides [**a few algorithms for regression**](https://spark.apache.org/docs/latest/ml-classification-regression.html#regression) - use both reasoning and experimentation to select a viable one
- build your pipeline using the building blocks provided by `pyspark.ml` (Estimator, Transformer, Pipeline...)


## Evaluation

`mllib.MulticlassMetrics` implements a number of standard metrics to evaluate the performance of a classifier.

In [None]:
import pandas
from pyspark.mllib.evaluation import MulticlassMetrics

---
_This notebook is licensed under a [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). Copyright © 2018 [Point 8 GmbH](https://point-8.de)_