# Regression MPG With Analytics Zoo

Here we are going to demonstrate an Analytics Zoo example with the Keras-style API.

In a regression problem, we aim to predict the output of a continuous value, like a price or a probability. Contrast this with a classification problem, where we aim to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This notebook uses the classic Auto MPG Dataset and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, we'll provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

This example uses the Analytics Zoo Keras-style API.

## Intialization

We need to initialize the context in Zoo, which will return for us the SparkContext which we will use to interact with Apache Spark. 

In [14]:

import zoo.common.nncontext
sc = zoo.common.nncontext.init_nncontext("MPG Regression")


# About the Data

The Auto MPG datase is available from the UCI Machine Learning Repository. It allows us to see different properties of cars, and we will be using that to predict the MPG of the car.

## Get the data

First download the dataset.

We will also be getting rid of all NA (empty) values in the dataset.

In [16]:
import pandas as pd
dataset_path = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin'] 
raw_dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset = raw_dataset.copy()
dataset = dataset.dropna()  # Drop any NA values -- we don't want to deal with rows with NA
dataset.tail()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1
397,31.0,4,119.0,82.0,2720.0,19.4,82,1


### One-Hot Encoding

The Origin variable is a categorical variable which is pretending to be a numeric variable.  To get better results, we should ensure that it is properly treated as a categorical variable.  Given that the cardinality (number of possible values) is low (3), we can make a one-hot-encoded version of it -- making 3 variables instead of one.

The nice thing about one-hot-encoded variables is that they are treated independently, so the model will make no assumptions about the ordinal nature of the number.

In [17]:
origin = dataset.pop('Origin')
dataset['USA'] = (origin == 1)*1.0
dataset['Europe'] = (origin == 2)*1.0
dataset['Japan'] = (origin == 3)*1.0
dataset.tail()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,USA,Europe,Japan
393,27.0,4,140.0,86.0,2790.0,15.6,82,1.0,0.0,0.0
394,44.0,4,97.0,52.0,2130.0,24.6,82,0.0,1.0,0.0
395,32.0,4,135.0,84.0,2295.0,11.6,82,1.0,0.0,0.0
396,28.0,4,120.0,79.0,2625.0,18.6,82,1.0,0.0,0.0
397,31.0,4,119.0,82.0,2720.0,19.4,82,1.0,0.0,0.0


## Partition into Training and Test

Here we will partition our dataset into training and test.  It is set for 80% training and 20% test.

We usually split up our dataset into training and test so we can get a fair evaluation on our model for data it has not yet seen.

In [18]:
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

## Get Labels

Here we will remove MPG from our dataset as it is our "label" -- the ground truth value we are trying to predict, and set it to our train_labels and test_labels.

In [19]:
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

In [20]:
x_train = train_dataset.values
y_train = train_labels.values
x_test = test_dataset.values
y_test = test_labels.values

## Data Preparation

Here we are going to see the shaepe of our data.

In [21]:

# see the shape
print("x_train", x_train.shape)
print("y_train", y_train.shape)
print("x_test", x_test.shape)
print("y_test", y_test.shape)

x_train (314, 9)
y_train (314,)
x_test (78, 9)
y_test (78,)


## Build Model

We are going to build a Neural Network using the Keras style API:

 * Dense Layer (64 Neurons), with Dropout
 * Dense Layer (64) Neurons), With Dropout
 * Output Layer : (1 Neuron)
 
![](../assets/images/network-01.png)

We will use MSE as our loss function (common for regression), and RMSProp as our optimizer, a variant of gradient descent.


In [23]:
from zoo.pipeline.api.keras.layers import Dense, Dropout
from zoo.pipeline.api.keras.models import Sequential

def build_model(train_x):
  model = Sequential() # Use sequential layer, outputs of one layer feed into inputs of the next

  model.add(Dense(64, input_dim=9, activation='tanh'))
  model.add(Dropout(0.2))  # 20% dropout to reduce overfitting.
  model.add(Dense(64, activation='tanh')) 
  model.add(Dropout(0.2))  # 20% dropout to reduce overfitting.
  model.add(Dense(output_dim=1, activation='linear'))
  
  model.compile(loss='mse', optimizer='rmsprop') # MSE for a regression problem
  return model

model = build_model(x_train)

creating: createZooKerasSequential
creating: createZooKerasDense
creating: createZooKerasDropout
creating: createZooKerasDense
creating: createZooKerasDropout
creating: createZooKerasDense
creating: createRMSprop
creating: createZooKerasMeanSquaredError


## Train the model

We will use `model.fit` in order to train our model.  We limit training to 20 epochs in order to ensure the notebook doesn't run too long, but more epochs could yield better results.

In [24]:
%%time
# Train the model
print("Training begins.")
model.fit(
    x_train,
    y_train,
    batch_size=16,  # Powers of 2 make good batch sizes
    nb_epoch=20)
print("Training completed.")

Training begins.
Training completed.
CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 25.5 s


## Prediction

* BigDL models make inferences based on the given data using `model.predict(val_rdd)` API. A result of RDD is returned. predict_class returns the predicted points. 

In [27]:
predictions = model.predict(x_test)

## Calcluate the Error

We will calculate the error, which is simply the difference between the predicted and ground truth.  We will then calculate the Mean squared error, which the mean (average) of the square root of the square.

In [28]:
# create the list of difference between prediction and test data
diff=[]
ratio=[]
predictions = model.predict(x_test)
p = predictions.collect()
predictions_list=[]
for u in range(len(y_test)):
    pr = p[u][0]
    predictions_list.append(pr)
    ratio.append((y_test[u]/pr)-1)
    diff.append(abs(y_test[u]- pr))

In [29]:
import numpy as np
RMSE = np.mean(np.sqrt(np.square(diff)))
print ("RMSE = " + str(RMSE))

RMSE = 6.9449465825007515


### Conclusion

Final RMSE is around 7, meaning that on average, our error is 7 MPG. This is reasonable considering the situation.  We could look at improving the results, however, by looking at the following:

 * Run for more epochs
 * Experiment with different batch sizes
 * Try different optimizers (Adam, SGD)
 * Try different loss functions.
 
### Lessons Learned

 * Keras can be very effective API for Zoo -- one that many will have prior exposure.
 * Categorical variables can be handled by one-hot-encoding, especially if their cardinality is small.
 * Dropout is effective at combatting overfitting.
 * MSE (Mean-Squared Error) is a common loss function for regression problems.