In a *regression* problem, we aim to predict the output of a continuous value, like a price or a probability. Contrast this with a *classification* problem, where we aim to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This notebook uses the classic [Auto MPG](https://archive.ics.uci.edu/ml/datasets/auto+mpg) Dataset and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, we'll provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

This example uses the `tf.keras` API, see [this guide](https://www.tensorflow.org/guide/keras) for details.

In [0]:
#!pip install --pre --quiet astroquery

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import pathlib
from astroquery.sdss import SDSS

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

1.13.1


## The Auto MPG dataset

The dataset is available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/).

### Get the data
First download the dataset.-

In [0]:
tdata = SDSS.query_sql("""SELECT TOP 1000
                      p.fiberMag_u, p.fiberMag_g, p.fiberMag_r, p.fiberMag_i, 
                      p.fiberMag_z,s.elodieTEff
                   
                      
                      FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
                      
                      WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class != 'QSO' AND s.class !='GALAXY'
                      
                      """)

In [0]:
datapd=data.to_pandas()
print(datapd)

     fiberMag_u  fiberMag_g  fiberMag_r  fiberMag_i  fiberMag_z  elodieTEff
0      21.83404    19.30373    17.88528    16.98804    16.41673        3980
1      19.36734    18.48736    18.15342    18.01115    17.91279        6000
2      19.42088    17.92244    17.13913    16.93484    16.73513        5000
3      19.15921    19.11771    19.36260    19.59598    19.76112        7652
4      18.80860    16.28457    15.33544    15.01370    14.78891        4528
5      24.39023    22.70759    20.93211    19.90295    19.12873        3717
6      17.93162    16.81432    16.49228    16.27050    16.17541        5860
7      16.02642    15.04047    14.88521    14.74759    14.71799        7000
8      20.74338    19.80917    19.62092    19.54036    19.61644        6597
9      21.84291    19.61091    18.44390    17.95710    17.71370        4382
10     18.81515    17.75691    17.59942    17.58388    17.64050        7000
11     17.04001    15.99135    15.63882    15.49512    15.45219        6000
12     20.01

In [0]:
#dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
#dataset_path

Import it using pandas

### Split the data into train and test

Now split the dataset into a training set and a test set.

We will use the test set in the final evaluation of our model.

In [0]:
train_datapd = datapd.sample(frac=0.8,random_state=0)
test_datapd = datapd.drop(train_datapd.index)


### Inspect the data

Have a quick look at the joint distribution of a few pairs of columns from the training set.

In [0]:
#sns.pairplot(train_datapd[["fiberMag_u", "fiberMag_g", "fiberMag_r", "fiberMag_i","fiberMag_z","elodieTEff"]], diag_kind="kde")

Also look at the overall statistics:

In [0]:

train_stats = train_datapd.describe()
train_stats.pop("elodieTEff")
train_stats = train_stats.transpose()
train_stats


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fiberMag_u,800.0,20.098393,1.895006,14.25133,18.84365,19.755265,21.32248,25.97398
fiberMag_g,800.0,18.723378,1.628038,14.7326,17.67019,18.686685,19.521535,25.53572
fiberMag_r,800.0,18.183431,1.428785,14.88521,17.246405,18.110275,18.857825,23.71967
fiberMag_i,800.0,17.86531,1.287266,14.74759,17.00186,17.76333,18.69831,21.98211
fiberMag_z,800.0,17.707342,1.304314,14.66739,16.766065,17.638395,18.49123,22.25513


### Split features from labels

Separate the target value, or "label", from the features. This label is the value that you will train the model to predict.

In [0]:
train_labels = train_datapd.pop('elodieTEff')
test_labels = test_datapd.pop('elodieTEff')


### Normalize the data

Look again at the `train_stats` block above and note how different the ranges of each feature are.

It is good practice to normalize features that use different scales and ranges. Although the model *might* converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

Note: Although we intentionally generate these statistics from only the training dataset, these statistics will also be used to normalize the test dataset. We need to do that to project the test dataset into the same distribution that the model has been trained on.

In [0]:
#normaalizar train label
def normlabels(x):
  r=(x-x.mean())/x.std()
  return r
#normalizar train feat
def normfeat(x):
  r=(x-x.mean())/x.std()
  return r

In [0]:

#normtest_temp=tf.keras.utils.normalize(test_labels,axis=0)
train_mean=train_labels.mean()
train_std=train_labels.std()
norm_trainlabels=(train_labels-train_mean)/train_std
print(norm_trainlabels)

993   -0.124885
859   -0.297759
298   -1.150643
553    0.299441
672    0.081233
971   -0.494811
27    -0.329795
231   -1.139763
306   -1.298734
706   -1.213506
496    1.592367
558    2.231878
784    0.081233
239    0.987913
578    0.081233
55     0.383460
906   -0.367876
175   -0.045702
14    -0.045702
77     0.383460
31    -1.298734
481   -0.350951
310   -1.298734
311   -0.487557
883    0.104202
788   -0.045702
45    -0.014875
103    0.987913
760    0.987913
1      0.081233
         ...   
216   -0.142414
735   -1.298734
129    1.592367
349   -0.896772
111    0.081233
166   -0.350951
207   -0.365458
438   -0.914301
552   -1.139763
274   -0.350951
974   -1.213506
591    0.685687
800    0.081233
937    0.987913
525    0.987913
287   -1.298734
469   -0.329795
326   -1.139763
121    1.592367
994    0.987913
507    0.081233
228    0.987913
673   -0.365458
595    0.081233
445    0.104202
117    0.908125
464    0.987913
25    -0.265723
110    1.592367
149   -1.298734
Name: elodieTEff, Length

This normalized data is what we will use to train the model.

Caution: The statistics used to normalize the inputs here (mean and standard deviation) need to be applied to any other data that is fed to the model, along with the one-hot encoding that we did earlier.  That includes the test set as well as live data when the model is used in production.



## The model

### Build the model

Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output layer that returns a single, continuous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on.

In [0]:
def build_model():
  model = keras.Sequential([
    layers.Dense(1, activation=tf.nn.sigmoid, input_shape=[len(train_datapd.keys())]),
    #layers.Dense(4, activation=tf.nn.relu),
    #layers.Dense(4, activation=tf.nn.relu),
    #layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mean_squared_error',
                optimizer=optimizer,
                metrics=['mean_absolute_error', 'mean_squared_error'])
  return model

model = build_model()

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.


### Inspect the model

Use the `.summary` method to print a simple description of the model

In [0]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 1)                 6         
Total params: 6
Trainable params: 6
Non-trainable params: 0
_________________________________________________________________


Now try out the model. Take a batch of `10` examples from the training data and call `model.predict` on it.

### Train the model

Train the model for 1000 epochs, and record the training and validation accuracy in the `history` object.

In [0]:
# Display training progress by printing a single dot for each completed epoch
class PrintDot(keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs):
    if epoch % 100 == 0: print('')
    print('.', end='')

EPOCHS = 1000

history = model.fit(
  norm_traindata, norm_traintemp,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[PrintDot()])

NameError: ignored

Visualize the model's training progress using the stats stored in the `history` object.

In [0]:
#hist = pd.DataFrame(history.history)
#hist['epoch'] = history.epoch
#hist.tail()

print(history)

def plot_history(history):
  hist = pd.DataFrame(history.history)
  hist['epoch'] = history.epoch

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Abs Error [K]')
  plt.plot(hist['epoch'], hist['mean_absolute_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
           label = 'Val Error')
  plt.ylim([0,5])
  plt.legend()

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Square Error [$K^2$]')
  plt.plot(hist['epoch'], hist['mean_squared_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_squared_error'],
           label = 'Val Error')
  plt.ylim([0,20])
  plt.legend()
  plt.show()


plot_history(history)

This graph shows little improvement, or even degradation in the validation error after about 100 epochs. Let's update the `model.fit` call to automatically stop training when the validation score doesn't improve. We'll use an *EarlyStopping callback* that tests a training condition for  every epoch. If a set amount of epochs elapses without showing improvement, then automatically stop the training.

You can learn more about this callback [here](https://www.tensorflow.org/versions/master/api_docs/python/tf/keras/callbacks/EarlyStopping).

In [0]:
model = build_model()

# The patience parameter is the amount of epochs to check for improvement
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

history = model.fit(norm_traindata, train_labels, epochs=EPOCHS,
                    validation_split = 0.2, verbose=0, callbacks=[early_stop, PrintDot()])

plot_history(history)

The graph shows that on the validation set, the average error is usually around +/- 2 MPG. Is this good? We'll leave that decision up to you.

Let's see how well the model generalizes by using the **test** set, which we did not use when training the model.  This tells us how well we can expect the model to predict when we use it in the real world.

In [0]:
loss, mae, mse = model.evaluate(norm_testdata, test_labels, verbose=0)

print("Testing set Mean Abs Error: {:5.2f} Kelvin".format(mae))

### Make predictions

Finally, predict MPG values using data in the testing set:

In [0]:
print(norm_testdata)

test_predictions = model.predict(norm_testdata).flatten()
print(test_predictions)

plt.scatter(test_labels, test_predictions, color='red')
plt.xlabel('True Values [K]')
plt.ylabel('Predictions [K]')
plt.axis('equal')
plt.axis('square')
#plt.xlim([0,plt.xlim()[1]])
#plt.ylim([0,100])
#plt.ylim([0,plt.ylim()[1]])
#_ = plt.plot([-100, 100], [-100, 100])

It looks like our model predicts reasonably well. Let's take a look at the error distribution.

In [0]:
error = test_predictions - test_labels
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [K]")
_ = plt.ylabel("Count")

It's not quite gaussian, but we might expect that because the number of samples is very small.



## Conclusion

This notebook introduced a few techniques to handle a regression problem.

* Mean Squared Error (MSE) is a common loss function used for regression problems (different loss functions are used for classification problems).
* Similarly, evaluation metrics used for regression differ from classification. A common regression metric is Mean Absolute Error (MAE).
* When numeric input data features have values with different ranges, each feature should be scaled independently to the same range.
* If there is not much training data, one technique is to prefer a small network with few hidden layers to avoid overfitting.
* Early stopping is a useful technique to prevent overfitting.
