# Load Data

Load sales data from S3 / HDFS. We use the built-in "csv" method, which can use the first line has column names and which also supports infering the schema automatically. We use both and save some code for specifying the schema explictly.

We also peek inside the data by retrieving the first five records.

In [None]:
import pandas as pd

pd.set_option('display.max_columns', None)

In [None]:
import pyspark.sql.functions as f

data = spark.read\
    .option("header","true")\
    .option("inferSchema","true")\
    .csv("s3a://dimajix-training/data/kc-house-data")

data.limit(5).toPandas()

## Inspect Schema

Now that we have loaded the data and that the schema was inferred automatically, let's inspect it.

In [None]:
## YOUR CODE HERE

# Initial Investigations

As a first step to get an idea of our data, we create some simple visualizations. We use the Python matplot lib package for creating simple two-dimensional plots, where the x axis will be one of the provided attributes and the y axis will be the house price.

In [None]:
%matplotlib inline

In [None]:
# Import relevant Python packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## House Price in Relation to sqft_living

Probably one of the most important attributes is the size of the house. This is provided in the data in the column "sqft_living". We extract the price column and the sqft_living column and create a simple scatter plot.

In [None]:
# Extract price and attribute
price = data.select("price").toPandas()
sqft_living = data.select("sqft_living").toPandas()

# Create simple scatter plot
plt.plot(sqft_living.iloc[:,0], price.iloc[:,0], ".")

# House Price in Relation to sqft_lot

Another interesting attribute for predicting the house price might be the size of the whole lot, which is provided in the column "sqft_lot". So let's create another plot, now with "price" and "sqft_lot".

In [None]:
## YOUR CODE HERE

# Perform Linear Regression

Let's try to fit a line into the picture by performing a linear regression. This is done in two steps:
1. Extract so called features from the raw data. The features have to be stored in a new column of type "Vector"
2. Train a linear regression model

In [None]:
from pyspark.ml.feature import *
from pyspark.ml.regression import *

# Extract features using VectorAssembler
vector_assembler = # YOUR CODE HERE
features = # YOUR CODE HERE

# Inspect features
features.limit(10).toPandas()

In [None]:
# Traing linear regression model
regression = # YOUR CODE HERE
model = # YOUR CODE HERE

## Inspect Model

Let's inspect the generated linear model. It has two fields, "intercept" and "coefficients" which completely describe the model.

The basic formular of the model is

    y = SUM(coeff[i]*x[i]) + intercept

where y is the prediction variable, and x[i] are the input variable.

In [None]:
print("Intercept: " + str(model.intercept))
print("Coefficients: " + str(model.coefficients))

## Plot Data and Model¶

Now let's overlay the original scatter plot with the trained model. The model encodes a line, which can be overlayed by an additional invocation of "plt.plot".

In [None]:
# For plotting the model, we need to generate input and output values. Input values are stored in "model_x"
model_x = np.linspace(0,14000,100)
# model_y contains the model applied to model_x. The model has only one feature and an intercept
model_y = model_x * model.coefficients[0] + model.intercept

plt.plot(sqft_living.iloc[:,0], price.iloc[:,0], ".")
plt.plot(model_x, model_y, "r")

# Measuring Fit

Now the important question of course is, how well does the model approximate the real data. We can find our by transforming our input data using the model. This is done by using the function

    model.transform
    
which accepts one parameter and adds a new column "prediction" to input data, which contains the evaluated model for each record.

In [None]:
prediction = ... # YOUR CODE HERE

# Take the first five records of the result "prediction" and display it as a Pandas dataframe
# YOUR CODE HERE

## Manually Calculate RMSE

Using SQL we compute the root mean squared error (RMSE). Formally it is calculated as

    SQRT(SUM((price - prediction)**2) / n)
    
where n is the number of records.

In [None]:
# YOUR CODE HERE

## Use Built in Functionality to Measure the Fit
Of course Spark ML already contains evaluators for the most relevant metrics

In [None]:
from pyspark.ml.evaluation import *

evaluator = # YOUR CODE HERE

# Measuring Generalization of Model

Now we have an idea how well the model approximates the given data. But for machine learning it is more important to understand how well a model generalizes from the training data to new data. New data could contain different outliers.

In order to measure the generalization of the model, we need to change our high level approach. Our new approach needs to provide distinct sets of training data and test data. We can create such data using the Spark method "randomSplit".

In [None]:
train_data, test_data = features.randomSplit([0.8,0.2], seed=0)

In [None]:
# Train a linear regression model
regression = # YOUR CODE HERE
model = # YOUR CODE HERE

In [None]:
# Now create predictions, but this time for the "test_data" and NOT for the training data itself
prediction = # YOUR CODE HERE

# Evaluate model using RegressionEvaluator again, but this time using the "prediction" data frame
# YOUR CODE HERE

# Improving Prediction

Now that we have a metric and a valid approachm, the next question is: How can we improve the model? So far we only used the column "sqft_living" for building the model, but we have much more information about the houses. A very simple way is to include more attributes into the feature vector.

Remember that the schema looked as follows:

    root
     |-- id: long (nullable = true)
     |-- date: string (nullable = true)
     |-- price: decimal(7,0) (nullable = true)
     |-- bedrooms: integer (nullable = true)
     |-- bathrooms: double (nullable = true)
     |-- sqft_living: integer (nullable = true)
     |-- sqft_lot: integer (nullable = true)
     |-- floors: double (nullable = true)
     |-- waterfront: integer (nullable = true)
     |-- view: integer (nullable = true)
     |-- condition: integer (nullable = true)
     |-- grade: integer (nullable = true)
     |-- sqft_above: integer (nullable = true)
     |-- sqft_basement: integer (nullable = true)
     |-- yr_built: integer (nullable = true)
     |-- yr_renovated: integer (nullable = true)
     |-- zipcode: integer (nullable = true)
     |-- lat: double (nullable = true)
     |-- long: double (nullable = true)
     |-- sqft_living15: integer (nullable = true)
     |-- sqft_lot15: integer (nullable = true)
     
We simply use all real numeric columns. Some columns like "condition", "grade", "zipcode" are categorical variables, which we don't want to use now.

In [None]:
# Extract features using VectorAssembler
vector_assembler = VectorAssembler(inputCols=[
            'bedrooms',
            'bathrooms',
            'sqft_lot',
            'floors',
            'sqft_above',
            'sqft_basement',
            'yr_built',
            'yr_renovated',
            'sqft_living15',
            'sqft_lot15'], 
        outputCol='features')
features = vector_assembler.transform(data)

# Again split into training and test data
# YOUR CODE HERE

# Traing linear regression model
# YOUR CODE HERE

# Evaluate model
# YOUR CODE HERE