Import Libraries

In [2]:
import numpy as np
import pandas as pd

Load Dataset

In [3]:
possum = pd.read_csv("datasets/possum.csv")
possum.head()

Unnamed: 0,case,site,Pop,sex,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
0,1,1,Vic,m,8.0,94.1,60.4,89.0,36.0,74.5,54.5,15.2,28.0,36.0
1,2,1,Vic,f,6.0,92.5,57.6,91.5,36.5,72.5,51.2,16.0,28.5,33.0
2,3,1,Vic,f,6.0,94.0,60.0,95.5,39.0,75.4,51.9,15.5,30.0,34.0
3,4,1,Vic,f,6.0,93.2,57.1,92.0,38.0,76.1,52.2,15.2,28.0,34.0
4,5,1,Vic,f,2.0,91.5,56.3,85.5,36.0,71.0,53.2,15.1,28.5,33.0


Inspect the dataset

In [4]:
possum.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   case      104 non-null    int64  
 1   site      104 non-null    int64  
 2   Pop       104 non-null    object 
 3   sex       104 non-null    object 
 4   age       102 non-null    float64
 5   hdlngth   104 non-null    float64
 6   skullw    104 non-null    float64
 7   totlngth  104 non-null    float64
 8   taill     104 non-null    float64
 9   footlgth  103 non-null    float64
 10  earconch  104 non-null    float64
 11  eye       104 non-null    float64
 12  chest     104 non-null    float64
 13  belly     104 non-null    float64
dtypes: float64(10), int64(2), object(2)
memory usage: 11.5+ KB


Data Cleaning and Preparation

We will start by deleting the columns we do not need for training. The columns are: 'case', 'site', 'pop', and 'sex'.

In [5]:
drop_columns = ['case', 'site', 'Pop', 'sex']
possum = possum.drop(columns=drop_columns, axis=1)
possum.head()
possum.info()

Unnamed: 0,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
0,8.0,94.1,60.4,89.0,36.0,74.5,54.5,15.2,28.0,36.0
1,6.0,92.5,57.6,91.5,36.5,72.5,51.2,16.0,28.5,33.0
2,6.0,94.0,60.0,95.5,39.0,75.4,51.9,15.5,30.0,34.0
3,6.0,93.2,57.1,92.0,38.0,76.1,52.2,15.2,28.0,34.0
4,2.0,91.5,56.3,85.5,36.0,71.0,53.2,15.1,28.5,33.0


There are also some columns that have missing values, We will delete them for now.

In [23]:
possum.dropna(subset=['age'], inplace=True)
possum.dropna(subset=['footlgth'], inplace=True)
possum.info()

Convert all datatypes of columns to float (if needed):

In [25]:
for column in possum.columns:
    possum[column] = possum[column].astype(float)

possum.head()

Unnamed: 0,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
0,8.0,94.1,60.4,89.0,36.0,74.5,54.5,15.2,28.0,36.0
1,6.0,92.5,57.6,91.5,36.5,72.5,51.2,16.0,28.5,33.0
2,6.0,94.0,60.0,95.5,39.0,75.4,51.9,15.5,30.0,34.0
3,6.0,93.2,57.1,92.0,38.0,76.1,52.2,15.2,28.0,34.0
4,2.0,91.5,56.3,85.5,36.0,71.0,53.2,15.1,28.5,33.0


Train-Test Split

We want to predict age from the rest of the features. X are the input features and y is the age(target).

In [26]:
features = possum.drop(['age'], axis=1).columns
X = possum[features]
y = possum['age']

We use sklearn to split X and y into training and testing datasets. The training set should use 80% of the data.

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.80,
                                                    test_size=0.20,
                                                    random_state=2) # set a random seed - do not modify

print("Training size:", "Rows:", X_train.shape[0], ", Columns:", X_train.shape[1])
print("Training size:", "Rows:", X_test.shape[0], ", Columns:", X_test.shape[1])


Training size: Rows: 80 , Columns: 9
Training size: Rows: 21 , Columns: 9


Create a Linear Regression Baseline

Later, if our model performs worse than Baseline, then it is not a good sign! After all, if a basic linear regression works just as well, there's no need for the neural network!

In [28]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

linear_model_test_predictions = linear_model.predict(X_test)
test_mse = mean_squared_error(y_test, linear_model_test_predictions)
print("Linear Regression - Test Set MSE:", test_mse)

Linear Regression - Test Set MSE: 4.48846949659521


The mean squared error is around 4.49. This is squared error. If we take the square root, we have about 2.12. One way of interpreting this is to say that the linear regression, on average, is off by 2.12 Years.

Now, we will train a Neural Network using PyTorch.

In [29]:
import torch
from torch import nn
from torch import optim

First we should convert the test and train datasets into tensors.

In [30]:
# Convert training set
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float).view(-1,1)

# Convert testing set
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float).view(-1,1)

Now we will create Neural Network:

In [51]:
model = nn.Sequential(
    nn.Linear(9, 90),
    nn.ReLU(),
    nn.Linear(90, 9),
    nn.ReLU(),
    nn.Linear(9, 1)
)

Create loss function and Optimizer:

In [52]:
loss = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Create a training loop:

In [53]:
num_epochs = 15000 # number of training iterations
for epoch in range(num_epochs):
    outputs = model(X_train_tensor) # forward pass 
    mse = loss(outputs, y_train_tensor) # calculate the loss 
    mse.backward() # backward pass
    optimizer.step() # update the weights and biases
    optimizer.zero_grad() # reset the gradients to zero

    # keep track of the loss during training
    if (epoch + 1) % 500 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], MSE Loss: {mse.item()}')

Epoch [500/15000], MSE Loss: 2.8215651512145996
Epoch [1000/15000], MSE Loss: 2.545774459838867
Epoch [1500/15000], MSE Loss: 2.3453962802886963
Epoch [2000/15000], MSE Loss: 2.2339372634887695
Epoch [2500/15000], MSE Loss: 2.0977540016174316
Epoch [3000/15000], MSE Loss: 1.9595229625701904
Epoch [3500/15000], MSE Loss: 1.9101145267486572
Epoch [4000/15000], MSE Loss: 1.8657677173614502
Epoch [4500/15000], MSE Loss: 1.7469756603240967
Epoch [5000/15000], MSE Loss: 1.7261836528778076
Epoch [5500/15000], MSE Loss: 1.6718543767929077
Epoch [6000/15000], MSE Loss: 1.6027047634124756
Epoch [6500/15000], MSE Loss: 1.6010640859603882
Epoch [7000/15000], MSE Loss: 1.5981775522232056
Epoch [7500/15000], MSE Loss: 1.5461862087249756
Epoch [8000/15000], MSE Loss: 1.4930334091186523
Epoch [8500/15000], MSE Loss: 1.4689452648162842
Epoch [9000/15000], MSE Loss: 1.4546693563461304
Epoch [9500/15000], MSE Loss: 1.4777178764343262
Epoch [10000/15000], MSE Loss: 1.4476120471954346
Epoch [10500/15000], 

Let's evaluate our Neural Network on test dataset:

In [54]:
model.eval() # set the model to evaluation mode
with torch.no_grad(): # disable gradient calculations
    predictions = model(X_test_tensor) # generate apartment rent predictions
    test_loss = loss(predictions, y_test_tensor) # calculate testing set MSE loss
    
print('Neural Network - Test Set MSE:', test_loss.item()) # print testing set MSE

Neural Network - Test Set MSE: 3.499732494354248


We improved our test loss to about 3.4, a 23% improvement on our linear regression baseline. So the nonlinearity introduced by the neural network actually helped us out.