# Regression MPG With Analytics Zoo

Here we are going to demonstrate an Analytics Zoo example with the Keras-style API.

In a regression problem, we aim to predict the output of a continuous value, like a price or a probability. Contrast this with a classification problem, where we aim to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This notebook uses the classic Auto MPG Dataset and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, we'll provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

This example uses the Analytics Zoo Keras-style API.

## Intialization

* import necesary libraries

In [2]:
import pandas as pd

In [9]:
import os

import zoo.common.nncontext
sc = zoo.common.nncontext.init_nncontext("MPG Regression")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
matplotlib.use('Agg')
%pylab inline
import seaborn as sns


Populating the interactive namespace from numpy and matplotlib


This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'module://ipykernel.pylab.backend_inline' by the following code:
  File "/opt/conda/envs/py35/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/py35/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/py35/lib/python3.5/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/opt/conda/envs/py35/lib/python3.5/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/opt/conda/envs/py35/lib/python3.5/site-packages/ipykernel/kernelapp.py", line 505, in start
    self.io_loop.start()
  File "/opt/conda/envs/py35/lib/python3.5/site-packages/tornado/platform/as

* import necessary modules

In [10]:
from zoo.pipeline.api.keras.layers import Dense, Dropout
from zoo.pipeline.api.keras.models import Sequential

## Data Check

# About the Data

The Auto MPG dataset
The dataset is available from the UCI Machine Learning Repository.

## Get the data

First download the dataset.

In [12]:
dataset_path = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin'] 
raw_dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset = raw_dataset.copy()
dataset = dataset.dropna()
dataset.tail()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1
397,31.0,4,119.0,82.0,2720.0,19.4,82,1


In [13]:
origin = dataset.pop('Origin')
dataset['USA'] = (origin == 1)*1.0
dataset['Europe'] = (origin == 2)*1.0
dataset['Japan'] = (origin == 3)*1.0
dataset.tail()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,USA,Europe,Japan
393,27.0,4,140.0,86.0,2790.0,15.6,82,1.0,0.0,0.0
394,44.0,4,97.0,52.0,2130.0,24.6,82,0.0,1.0,0.0
395,32.0,4,135.0,84.0,2295.0,11.6,82,1.0,0.0,0.0
396,28.0,4,120.0,79.0,2625.0,18.6,82,1.0,0.0,0.0
397,31.0,4,119.0,82.0,2720.0,19.4,82,1.0,0.0,0.0


In [14]:
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

In [15]:
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

In [16]:
x_train = train_dataset.values
y_train = train_labels.values
x_test = test_dataset.values
y_test = test_labels.values

## Data Preparation

In [17]:

# see the shape
print("x_train", x_train.shape)
print("y_train", y_train.shape)
print("x_test", x_test.shape)
print("y_test", y_test.shape)

x_train (314, 9)
y_train (314,)
x_test (78, 9)
y_test (78,)


## Build Model

We are going to build a Neural Network using the Keras style API:

 * Dense Layer (64 Neurons), with Dropout
 * Dense Layer (64) Neurons), With Dropout
 * Output Layer : (1 Neuron)
 


In [18]:
def build_model(train_x):
  model = Sequential()

  model.add(Dense(64, input_dim=9))
  model.add(Dropout(0.2))
  model.add(Dense(64))
  model.add(Dropout(0.2))
  model.add(Dense(output_dim=1))
  
  model.compile(loss='mse', optimizer='rmsprop')
  return model

model = build_model(x_train)

creating: createZooKerasSequential
creating: createZooKerasDense
creating: createZooKerasDropout
creating: createZooKerasDense
creating: createZooKerasDropout
creating: createZooKerasDense
creating: createRMSprop
creating: createZooKerasMeanSquaredError


## Train the model

In [19]:
%%time
# Train the model
print("Training begins.")
model.fit(
    x_train,
    y_train,
    batch_size=20,
    nb_epoch=20)
print("Training completed.")

Training begins.
Training completed.
CPU times: user 210 ms, sys: 50 ms, total: 260 ms
Wall time: 49.1 s


## Prediction

* BigDL models make inferences based on the given data using model.predict(val_rdd) API. A result of RDD is returned. predict_class returns the predicted points. 

In [21]:
predictions = model.predict(x_test)

In [22]:
# create the list of difference between prediction and test data
diff=[]
ratio=[]
predictions = model.predict(x_test)
p = predictions.collect()
predictions_list=[]
for u in range(len(y_test)):
    pr = p[u][0]
    predictions_list.append(pr)
    ratio.append((y_test[u]/pr)-1)
    diff.append(abs(y_test[u]- pr))

In [14]:
RMSE = np.mean(np.sqrt(np.square(diff)))
print ("RMSE = " + str(RMSE))

RMSE = 9.133555566347562
