# Load Data

Load sales data from S3 / HDFS. We use the built-in "csv" method, which can use the first line has column names and which also supports infering the schema automatically. We use both and save some code for specifying the schema explictly.

We also peek inside the data by retrieving the first five records.

In [None]:
from pyspark.sql.functions import *

raw_data = spark.read\
    .option("header","true")\
    .option("inferSchema","true")\
    .csv("s3://dimajix-training/data/kc-house-data")

raw_data.limit(5).toPandas()

## Inspect Schema

Now that we have loaded the data and that the schema was inferred automatically, let's inspect it.

In [None]:
# Print the schema of raw_data
# YOUR CODE HERE

## Split training / validation set

First we need to split the data into a training and a validation set. Spark already provides a DataFrame method called `randomSplit` which takes an array of weights (between 0 and 1) and creates as many subsets. In our example, we want to create a training data set with 80% and the validation set should contain the remaining 20%.

In [None]:
# Split the data - 80% for training, 20% for validation
# YOUR CODE HERE

print("training_data = " + str(training_data.count()))
print("validation_data = " + str(validation_data.count()))

## Adding more Features

The RMSE tells us that on average our prediction actually performs pretty bad. How can we improve that? Obviously we used only the size of the house for the price prediction so far, but we have a whole lot of additional information. So let's make use of that. The mathematical idea is that we create a more complex (but still linear) model that also includes other features.

Let's recall that a linear  model looks as follows:

    y = SUM(coeff[i]*x[i]) + intercept
    
This means that we are not limited to single feature `x`, but we can use many features `x[0]...x[n]`. Let's do that with the house data!

### Inspect data

Since we don't have any additional information, we model some of the features differently. So far we used all features as direct linear predictors, which implies that a grade of 4 is twice as good as 2. Maybe that is not the case and not all predictors have a linear influence. Specifically nominal and ordinal features should be modeled differntly as categories. More an that later.

First let's have a look at the data agin using Spark `describe`

In [None]:
raw_data.describe().toPandas()

Additionally let's check how many different zip codes are present in the data. If they are not too many, we could consider creating a one-hot encoded feature from the zip codes. We use the SQL function `countDistinct` to find the number of different zip codes.

In [None]:
# Count the number of distinct ZIP Codes using the SQL function countDistinct
# YOUR CODE HERE

## New Features using One-Hot Encoding

A simple but powerful method for creating new features from categories (i.e. nominal and ordinal features) is to use One-Hot-Encoding. For each nominal feature, the set of all possible values is indexed from 0 to some n. But since it cannot be assumed that larger values for n have a larger impact, a different approach is chosen. Instead each possible values is encoded by a 0/1 vector with only a single entry being one.

Lets try that with the tools Spark provides to us.

### Indexing Nominal Data
First we need to index the data. Since Spark cannot know, which or how many distinct values are present in a specific column, the `StringIndexer` works like a ML algorithm: First it needs to be fit to the data, thereby returning an `StringIndexerModel` which then can be used for transforming data.

Let's perform both steps and let us look at the result

In [None]:
from pyspark.ml.feature import *

indexer = StringIndexer() \
    .setInputCol("zipcode") \
    .setOutputCol("zipcode_idx") \
    .setHandleInvalid("keep")

# Create index model using the `fit` method
index_model = # YOUR CODE HERE

# Apply the index by using the `transform` method of the index model
indexed_zip_data = # YOUR CODE HERE

# Inspect the result
indexed_zip_data.limit(10).toPandas()

An alternative way of configuring the indexer is to specify all relevant parameters in its constructor as follows:

In [None]:
indexer = StringIndexer(
    inputCol = "zipcode",
    outputCol = "zipcode_idx",
    handleInvalid = "keep")

### One-Hot-Encoder

Now we have a single number (the index of the value) in a new column `zipcode_idx`. But in order to use the information in a linear model, we need to create sparse vectors from this index with only exactly one `1`. This can be done with the `OneHotEncoder` transformer. This time no fitting is required, the class can be used directly with its `transform` method.

In [None]:
encoder = OneHotEncoder() \
    .setInputCol("zipcode_idx") \
    .setOutputCol("zipcode_onehot")

encoded_zip_data = encoder.transform(indexed_zip_data)
encoded_zip_data.limit(10).toPandas()

# Creating Pipelines

Since it would be tedious to add all features one after another and apply a full chain of transformations to the training set, the validation set and eventually to new data, Spark provides a `Pipeline` abstraction. A Pipeline simply contains a sequence of Transformations and (possibly multiple) machine learning algorithms. The whole pipeline then can be trained using the `fit` method which will return a `PipelineModel` instance. This instance contains all transformers and trained models and then can be used directly for prediction.

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import *
from pyspark.ml.regression import *

pipeline = Pipeline(stages = [
    # For every nominal feature, you have to create a pair of StringIndexer and OneHotEncoder. 
    # The StringIndexer should store its index result in some new column, which then is used 
    # by the OneHotEncoder to create a one-hot vector.
    StringIndexer(
        inputCol = "bathrooms",
        outputCol = "bathrooms_idx",
        handleInvalid = "keep"),
    OneHotEncoder(
        inputCol = "bathrooms_idx",
        outputCol = "bathrooms_onehot"),
    # Add StringIndexers and OneHotEncoders for the following nominal columns:
    # "bedrooms", "floors", "grade", "zipcode"
    # YOUR CODE HERE
    
    # In addition add OneHotEncoder for the columns "view" and "condition"
    # YOUR CODE HERE
    
    # Now add a VectorAssembler which collects all One-Hot encoded columns and the following numeric columns:
    # "sqft_living", "sqft_lot", "waterfront", "sqft_above", "sqft_basement", "yr_built", "yr_renovated", "sqft_living15", "sqft_lot15"
    # YOUR CODE HERE
    
    # Finally add a LinearRegression which uses the output of the VectorAssembler as features and the
    # target variable "price" as label column
    # YOUR CODE HERE

    ]
)


### Train model with training data

Once you created the `Pipeline`, you can fit it in a single step using the `fit` method. This will return an instance of the class `PipelineModel`. Assign this model instace to a value called `model`.

And remember: Use the training data for fitting!

In [None]:
model = # YOUR CODE HERE

## Evaluate model using validation data

Now that we have a model, we need to measure its performance. This requires that predictions are created by applying the model to the validation data by using the `transform` method of the moodel. The quality metric of the prediction is implemented in the `RegressionEvaluator` class from the Spark ML evaluation package. Create an instance of the evaluator and configure it appropriately to use the column `price` as the target (label) variable and the column `prediction` (which has been created by the pipeline model) as the prediction column. Also remember to set the metric name to `rmse`. Finally feed in the predicted data into the evaluator, which in turn will calculate the desired quality metric (RMSE in our case).

In [None]:
from pyspark.ml.evaluation import *

# Create and configure a RegressionEvaluator
evaluator = # YOUR CODE HERE

# Create predictions of the validationData by using the "transform" method of the model
pred = # YOUR CODE HERE

# Now measure the quality of the prediction by using the "evaluate" method of the evaluator
rmse = # YOUR CODE HERE

print("RMSE = " + str(rmse))

# Adding more models

Another way of improving the overall prediction is to add multiple models to a single Pipeline. Each downstream ML algorithm has access to the prediction of the previous stages. This way we can create two independant models and eventually fit a mixed model as the last step. In this example we want to use a simple linear model created by a `LinearRegression` and combine that model with a Poisson model created by a `GeneralizedLinearRegression`. The results of both models eventually are combined using a final `LinearRegression` model.

In [None]:
pipeline = Pipeline(stages = [
    # Extract all features as done before including the VectorAssembler as the last step
    # YOUR CODE HERE

    # Now add a LinearRegression, but the prediction should be stored in a column "linear_prediction" instead of the default column.
    # This will be our first (linear) prediciton
    # YOUR CODE HERE

    # Now add a GeneralizedLinearRegression, which should also use the features as its input and the price as the target
    # variable. Lookup settings for the GeneralizedLinearRegression in the Spark documentation and select the "poisson"
    # family and the "log" link function. The prediction column should be "poisson_prediction"
    # YOUR CODE HERE

    # Now create a new feature from both prediction columns from both regressions above. This is done by using
    # a new VectorAssembler. Set the name of the feature column to "pred_features"
    # YOUR CODE HERE
        
    LinearRegression(
        featuresCol = "pred_features",
        labelCol = "price",
        predictionCol = "prediction")
    ]
)

### Train model with training data

Again as usual we train a model using the `fit` method of the pipeline.

In [None]:
model = # YOUR CODE HERE

### Evaluate model using validation data

And eventually we measure the performance of the combined model by using the evaluator created some steps above.

In [None]:
# First create predictions by applying the learnt pipeline model to the validation data
pred = # YOUR CODE HERE

# And now calculate the performance metric by using the evaluator on the predictions
rmse = # YOUR CODE HERE

print("RMSE = " + str(rmse))

### Inspect Model

Let us inspect the coefficients of the last step, which tells us which of both models (linear or poisson) has more weight.

In [None]:
model.stages[len(model.stages)-1].coefficients