# Linear regression: Predict fuel efficiency

In a regression problem, the aim is to predict the output of a continuous value, like a price or a probability. Contrast this with a classification problem, where the aim is to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This notebook uses the classic [Auto MPG Dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg) and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

In this notebook we will predict `mpg` by using **linear regression**.

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

In [3]:
#!pip install tensorflow

import tensorflow as tf
from tensorflow import keras
from keras import layers

In [6]:
# We load the data by using the keras.utils.get_file function
dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path

column_names = ['mpg','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
raw_dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

df = raw_dataset.copy()
df.tail()

# We're showing a small subset of the data to give you a feel for what we're working with.
df.head()

Unnamed: 0,mpg,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1


Our dataset has the following atributes:
- mpg
- cylinders
- displacement
- horsepower
- weight
- acceleration
- model-year

In [7]:
# In case there are any unknown values in the data, we remove them.
df = df.replace('?', np.nan)
df = df.dropna()

In [8]:
# We divide the data into training and test set
X = df.drop('mpg', axis=1)          # mpg we omit because it represents a target variable
y = df[['mpg']]

The `train_test_split` function divides the data set into training and test depending on the passed `test_size` relationship. In our case, 25% of the data will be taken as a test set, and the rest (75%) as the training set.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [10]:
# We construct the linear regression and we train the model.
reg = LinearRegression()
reg.fit(X_train, y_train)

In [11]:
print('Free term is {}\n'.format(reg.intercept_[0]))
for idx, col_name in enumerate(X_train.columns):
    print('Coefficient for {} equals {}'.format(col_name, reg.coef_[0][idx]))

Free term is -17.93882554028429

Coefficient for Cylinders equals -0.40160950553132513
Coefficient for Displacement equals 0.015042489927558413
Coefficient for Horsepower equals -0.021494344251586424
Coefficient for Weight equals -0.006055056817090906
Coefficient for Acceleration equals 0.033584574933294684
Coefficient for Model Year equals 0.7625597816194113
Coefficient for Origin equals 1.6215111278712948


In [12]:
# We evaluate R^2 metric.
r2_test = reg.score(X_test, y_test)
r2_train = reg.score(X_train, y_train)
print('\nR^2 test = {}'.format(r2_test))
print('R^2 train = {}'.format(r2_train))

# Second possibility
# r2_test = r2_score(y_test, reg.predict(X_test))
# r2_train = r2_score(y_train, reg.predict(X_train))


R^2 test = 0.7988908872869827
R^2 train = 0.8248209756886098


In [13]:
# We calculate the mean squared error
mse_test = mean_squared_error(y_test, reg.predict(X_test))
mse_train = mean_squared_error(y_train, reg.predict(X_train))
print('\nMSE test = {}'.format(mse_test))
print('MSE train = {}'.format(mse_train))


MSE test = 10.14255845492744
MSE train = 11.205041479255197


In [14]:
# How many miles do we go wrong on average?
# We will calculate root-mean-square error to return to the order of magnitude of the target variable.
rmse_test = np.sqrt(mse_test)
rmse_train = np.sqrt(mse_train)
print('\nRMSE test = {}'.format(rmse_test))
print('RMSE train = {}'.format(rmse_train))


RMSE test = 3.184738365223655
RMSE train = 3.34739323642371
