# Getting data

We will use data from a Kaggle housing prices set:

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

They provide a training set and a test set for us. So we will simply split the training set into training and validation.

Those files are at:

* `./data/original/train.csv`
* `./data/original/test.csv`

In [72]:
# This boilerplate allows me to import from my own Python modules
# Taken from https://stackoverflow.com/a/39311677
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)
    
import numpy as np
import torch
import math
import lib

In [3]:
npdata = np.genfromtxt(
  open('./data/original/train.csv'),
    delimiter=',',
    dtype='unicode'
)
npdata

array([['Id', 'MSSubClass', 'MSZoning', ..., 'SaleType', 'SaleCondition',
        'SalePrice'],
       ['1', '60', 'RL', ..., 'WD', 'Normal', '208500'],
       ['2', '20', 'RL', ..., 'WD', 'Normal', '181500'],
       ...,
       ['1458', '70', 'RL', ..., 'WD', 'Normal', '266500'],
       ['1459', '20', 'RL', ..., 'WD', 'Normal', '142125'],
       ['1460', '20', 'RL', ..., 'WD', 'Normal', '147500']], dtype='<U13')

In [4]:
# Inspect headers
npdata[0]

array(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu',
       'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars',
       'GarageArea', 'GarageQual', 'GarageCond', 'Pav

Let's start by building a small model just to make sure we can stuff working. So let's pull out just four columns.

In [5]:
npdata[0, [4, 38, 46, 62]]

array(['LotArea', 'TotalBsmtSF', 'GrLivArea', 'GarageArea'], dtype='<U13')

In [6]:
# We are trying to predict
npdata[0, [80]]

array(['SalePrice'], dtype='<U13')

In [8]:
class HouseDataset(torch.utils.data.Dataset):
    def __init__(self):
        npdata = np.genfromtxt(
            open('./data/original/train.csv'),
            delimiter=',',
            dtype='unicode'
        )
        np_inputs = npdata[1:, [4, 38, 46, 62]].astype(np.float32)
        np_outputs = npdata[1:, [80]].astype(np.float32)
        
        self.inputs = torch.from_numpy(np_inputs)
        self.outputs = torch.from_numpy(np_outputs)
        
        
    def __len__(self):
        return len(self.inputs)
        
    def __getitem__(self, index):
        return (self.inputs[index], self.outputs[index])
        
    

In [9]:
dataset = HouseDataset()

In [10]:
dataset.__getitem__(0)

(tensor([8450.,  856., 1710.,  548.]), tensor([208500.]))

# Split the dataset

We want 80% of the dataset to train on, and will leave 20% to validate with.

In [16]:
train_size = math.floor(len(dataset) * 0.8)
train_size

1168

In [17]:
val_size = len(dataset) - train_size
val_size

292

In [18]:
train_data, val_data = torch.utils.data.random_split(
  dataset,
  [train_size, val_size] # You can do any arbitrary amount of splitting here!
)
train_data

<torch.utils.data.dataset.Subset at 0x12480b668>

In [19]:
print(len(train_data), len(val_data))

1168 292


We know need to transform these split data sets into two "data loaders." The data loaders handle mini-batching.

In [21]:
# With only using 4 features, we could put our batch size in the thousands on a decent machine.
# However, we really want to see the batching happen
# You want to shuffle to prevent preset-ordering effecting the training
train_loader = torch.utils.data.DataLoader(train_data, batch_size=128, shuffle=True)
train_loader

<torch.utils.data.dataloader.DataLoader at 0x1243314e0>

In [22]:
val_loader = torch.utils.data.DataLoader(val_data, batch_size=128, shuffle=True)

In [71]:
net = lib.network.LinearNet([4, 8, 16, 8, 1])
net

LinearNet(
  (layer1): Linear(in_features=4, out_features=8, bias=True)
  (layer2): Linear(in_features=8, out_features=16, bias=True)
  (layer3): Linear(in_features=16, out_features=8, bias=True)
  (layer4): Linear(in_features=8, out_features=1, bias=True)
)

In [60]:
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)

We need to do a training loop for each epoch, but now also an internal loop for each minibatch. The data loader makes this easy.

In [65]:
total_pct_error = 0
total_count = 0


epochs = 25
for i in range(epochs):
    for j, data in enumerate(train_loader, 0):
        inputs = data[0]
        outputs = data[1]
        
        net.zero_grad()
        net_outputs = net(inputs)
        
        # We insert our own tracking code here separate from the learning code
        # to make it easier to track model performance over time.
        diff = (net_outputs - outputs).abs()
        pct_error = (diff / outputs * 100).mean().item()
        total_pct_error += (pct_error * len(inputs))
        total_count += len(inputs)
        
        
        loss = criterion(net_outputs, outputs)
        loss.backward()
        optimizer.step()
        
print(f"{round(total_pct_error / total_count, 2)}% error.")

21.44% error.


Great, we just trained our model for 25 epochs across all of our train data! Now let's devise a way to validate the result.

# Validation

Now let's iterate over validation set and compare the percent accuracy against our training set.

In [66]:
total_pct_error = 0
total_count = 0


for j, data in enumerate(val_loader, 0):
    inputs = data[0]
    outputs = data[1]
        
    net.zero_grad()
    net_outputs = net(inputs)
        
    # We insert our own tracking code here separate from the learning code
    # to make it easier to track model performance over time.
    diff = (net_outputs - outputs).abs()
    pct_error = (diff / outputs * 100).mean().item()
    total_pct_error += (pct_error * len(inputs))
    total_count += len(inputs)
        
        
    loss = criterion(net_outputs, outputs)
    loss.backward()
    optimizer.step()
        
print(f"{round(total_pct_error / total_count, 2)}% error.")

20.38% error.


# Saving a model


In [68]:
torch.save({
    'model': net.state_dict(),
    'optimizer': optimizer.state_dict() 
}, './models/house_prices.pt')