# Auto MPG Prediction using Torch

This notebook is a simple example of how to use PyTorch to build a simple neural network to predict the fuel efficiency of late-1970s and early 1980s automobiles. The dataset is taken from the UCI Machine Learning Repository and can be found [here](https://archive.ics.uci.edu/ml/datasets/auto+mpg).

The dataset contains 398 rows and 9 columns. Each row is a car and each column is a feature of the car. The columns are as follows:

1. mpg: miles per gallon
2. cylinders: number of cylinders
3. displacement: engine displacement in cubic inches
4. horsepower: engine horsepower
5. weight: vehicle weight
6. acceleration: time to accelerate from 0 to 60 mph
7. model year: model year
8. origin: origin of car
9. car name: name of car

We will use the first 7 columns to predict the miles per gallon of the car.

## Downloading Required Libraries

In [1]:
!pip install pandas==2.0.3 torch==2.2.0 scikit-learn==1.3.2



## Loading the data

In [2]:
import pandas as pd


url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement',
                'Horsepower', 'Weight', 'Acceleration', 'Model Year', 'Origin']


df = pd.read_csv(url, names=column_names, na_values="?",
                 comment='\t', sep=" ", skipinitialspace=True)

## Preprocessing the data

### Drop the missing values

In [3]:
df = df.dropna()
df = df.reset_index(drop=True)

### split the data into train and test

In [4]:
import sklearn
from sklearn import model_selection

df_train, df_test = model_selection.train_test_split(
    df, train_size=0.8, random_state=1)

train_status = df_train.describe().transpose()
train_status

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MPG,313.0,23.404153,7.666909,9.0,17.5,23.0,29.0,46.6
Cylinders,313.0,5.402556,1.701506,3.0,4.0,4.0,8.0,8.0
Displacement,313.0,189.51278,102.675646,68.0,104.0,140.0,260.0,455.0
Horsepower,313.0,102.929712,37.919046,46.0,75.0,92.0,120.0,230.0
Weight,313.0,2961.198083,848.602146,1613.0,2219.0,2755.0,3574.0,5140.0
Acceleration,313.0,15.704473,2.725399,8.5,14.0,15.5,17.3,24.8
Model Year,313.0,75.929712,3.675305,70.0,73.0,76.0,79.0,82.0
Origin,313.0,1.591054,0.807923,1.0,1.0,1.0,2.0,3.0


### Normalize the numerical columns

In [5]:
print(f'before normalization:\n', df_train.head())
numeric_column_names = ['Cylinders', 'Displacement',
                        'Horsepower', 'Weight', 'Acceleration']

df_train_norm, df_test_norm = df_train.copy(), df_test.copy()

for col_name in numeric_column_names:
    mean = train_status.loc[col_name, 'mean']
    std = train_status.loc[col_name, 'std']

    df_train_norm.loc[:, col_name] = (
        df_train_norm.loc[:, col_name] - mean) / std

    df_test_norm.loc[:, col_name] = (
        df_test_norm.loc[:, col_name] - mean) / std
print()
print('\nafter normalization:\n', df_train_norm.head())


before normalization:
       MPG  Cylinders  Displacement  Horsepower  Weight  Acceleration  \
334  27.2          4         135.0        84.0  2490.0          15.7   
258  18.6          6         225.0       110.0  3620.0          18.7   
139  29.0          4          98.0        83.0  2219.0          16.5   
310  37.2          4          86.0        65.0  2019.0          16.4   
349  33.0          4         105.0        74.0  2190.0          14.2   

     Model Year  Origin  
334          81       1  
258          78       1  
139          74       2  
310          80       3  
349          81       2  


after normalization:
       MPG  Cylinders  Displacement  Horsepower    Weight  Acceleration  \
334  27.2  -0.824303     -0.530922   -0.499214 -0.555264     -0.001641   
258  18.6   0.351127      0.345625    0.186457  0.776338      1.099115   
139  29.0  -0.824303     -0.891280   -0.525586 -0.874613      0.291894   
310  37.2  -0.824303     -1.008153   -1.000281 -1.110294      0.2552

### Bucketizing

In [6]:
import torch

boundaries = torch.tensor([73, 76, 79])
v = torch.tensor(df_train_norm['Model Year'].values)

df_train_norm['Model Year Bucketed'] = torch.bucketize(
    v, boundaries, right=True)

v = torch.tensor(df_test_norm['Model Year'].values)
df_test_norm['Model Year Bucketed'] = torch.bucketize(v, boundaries, right=True)

numeric_column_names.append('Model Year Bucketed')

### Hot Encoding

In [7]:
from torch.nn.functional import one_hot

total_origin = len(set(df_train_norm['Origin']))

origin_encoded = one_hot(torch.from_numpy(
    df_train_norm['Origin'].values) % total_origin)

x_train_numeric = torch.tensor(df_train_norm[numeric_column_names].values)

x_train = torch.cat([x_train_numeric, origin_encoded], 1).float()

origin_encoded = one_hot(torch.from_numpy(
    df_test_norm['Origin'].values) % total_origin)

x_test_numeric = torch.tensor(df_test_norm[numeric_column_names].values)

x_test = torch.cat([x_test_numeric, origin_encoded], 1).float()

### Create Y label (Ground Truth)

In [8]:
y_train = torch.tensor(df_train_norm['MPG'].values).float()
y_test = torch.tensor(df_test_norm['MPG'].values).float()

## Training

In [9]:
from torch.utils.data import TensorDataset, DataLoader

train_ds = TensorDataset(x_train, y_train)
batch_size = 8

torch.manual_seed(1)

train_dl = DataLoader(train_ds, batch_size, shuffle=True)

In [10]:
from torch import nn
hidden_units = [8, 4]
input_size = x_train.shape[1]

all_layers = []

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit

all_layers.append(nn.Linear(hidden_units[-1], 1))

model = nn.Sequential(*all_layers)
model

Sequential(
  (0): Linear(in_features=9, out_features=8, bias=True)
  (1): ReLU()
  (2): Linear(in_features=8, out_features=4, bias=True)
  (3): ReLU()
  (4): Linear(in_features=4, out_features=1, bias=True)
)

In [11]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

In [12]:
torch.manual_seed(1)
num_epochs = 200
log_epochs = 20

for epoch in range(num_epochs):
    loss_train_hist = 0

    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = loss_fn(pred, y_batch)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        loss_train_hist += loss.item()

    if epoch % log_epochs == 0:
        print(f'Epoch {epoch} \tLoss\t{loss_train_hist/len(train_dl):.4f}')


Epoch 0 	Loss	536.1047
Epoch 20 	Loss	8.4361
Epoch 40 	Loss	7.8695
Epoch 60 	Loss	7.1891
Epoch 80 	Loss	6.7064
Epoch 100 	Loss	6.7603
Epoch 120 	Loss	6.3107
Epoch 140 	Loss	6.6884
Epoch 160 	Loss	6.7549
Epoch 180 	Loss	6.2029


In [13]:
with torch.no_grad():
    pred = model(x_test.float())[:, 0]
    loss = loss_fn(pred, y_test)

    print(f'Test MSE:\t{loss.item():.4f}')
    print(f'Test MAE:\t{nn.L1Loss()(pred, y_test).item():.4f}')

Test MSE:	9.5907
Test MAE:	2.1177
