Adapting from my own notebook here : https://www.kaggle.com/fanbyprinciple/beginner-pytorch-notebook-on-housing-dataset
I feel I have forgotten stuff. This notebook was originally in part of the D2l book. https://d2l.ai/. Much recommended. Explanations of stuff taken from this book.

# LETS LOAD THE DATA

We use pandas to load the two csv files containing training and test data respectively.

In [231]:
import numpy
import pandas as pd
import torch
import torch.nn as nn

In [232]:
train_data = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test_data = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

The training dataset includes 1460 examples, 80 features, and 1 label, while the test data contains 1459 examples and 80 features.

In [233]:
train_data.head()

In [234]:
train_data.shape

How to look at top 4 rows for columns

In [235]:
train_data.iloc[0:4, [1,2,3,-1,-2]]

train data has one column more for labels

In [236]:
len(test_data.columns), len(train_data.columns)

# DATA PREPROCESSING

since we are going to use both of train data and test data through the preprocessing we first concatenate it. We can see that in each example, (the first feature is the ID.) This helps the model identify each training example. While this is convenient, it does not carry any information for prediction purposes. Hence, (we remove it from the dataset) before feeding the data into the model.

In [237]:
all_features = pd.concat((train_data.iloc[:,1:-1], test_data.iloc[:,1:]))

# row wise concat

In [238]:
all_features.head()

### 1. NORMALISATION

Now lets look for numerical features, since we would want to normalise numerical features first.

In [239]:
all_features.dtypes

We can use '==' on dataframes for comparison! here is how to find only numerical features.

In [240]:
all_features.dtypes[all_features.dtypes == 'int64']

In [241]:
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features]

In [242]:
# numeric_features = all_features.dtypes[all_features.dtypes=='int64'].index

Notice the .index, it only returns the column name

In [243]:
def normalise(x):
    return ((x - x.mean())/x.std())

Applying normalisation. 

As stated above, we have a wide variety of data types. We will need to preprocess the data before we can start modeling. Let us start with the numerical features. First, we apply a heuristic, [replacing all missing values by the corresponding feature's mean.] Then, to put all features on a common scale, we (standardize the data by rescaling features to zero mean and unit variance):

$$x \leftarrow \frac{x - \mu}{\sigma},$$
where $\mu$ and $\sigma$ denote mean and standard deviation, respectively. To verify that this indeed transforms our feature (variable) such that it has zero mean and unit variance, note that $E[\frac{x-\mu}{\sigma}] = \frac{\mu - \mu}{\sigma} = 0$ and that $E[(x-\mu)^2] = (\sigma^2 + \mu^2) - 2\mu^2+\mu^2 = \sigma^2$. Intuitively, we standardize the data for two reasons. First, it proves convenient for optimization. Second, because we do not know a priori which features will be relevant, we do not want to penalize coefficients assigned to one feature more than on any other.

In [244]:
# If test data were inaccessible, mean and standard deviation could be
# calculated from training data

all_features[numeric_features] = all_features[numeric_features].apply(
    lambda x: (x - x.mean()) / (x.std()))


notice the difference in all_features before and after normalisation in `all_features[numeric_features]`

In [245]:
all_features[numeric_features]


### HANDLING MISSING ENTRIES

one of first jobs we have to do handlind datasets is to handle the missing values

In [246]:
all_features[numeric_features] = all_features[numeric_features].fillna(0)

### ONE HOT ENCODING

discrete features now need to be one hot encoded. The discrete columns are devided based on value, and 0 or 1 put in columns where the value is true, for example if `SaleType` had two discrete values other or WD, there would be two columns made `SaleType_other` and `Saletype_WD`

In [247]:
all_features = pd.get_dummies(all_features, dummy_na=True)

all_features


notice that now there are 331 columns instead of earlier 25. Data prerpocessing is now over.

### BIFURCATING TEST AND TRAIN DATA

Bifurcating data back to train and test data. And converting them into float32 torch tensor. Via the values attribute, we can [extract the NumPy format from the pandas format and convert it into the tensor] representation for training.

In [248]:
n_train = len(train_data)

train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)

train_labels = torch.tensor(train_data.iloc[:,-1].values, dtype=torch.float32)


Since its the beginning we will go for a proof of concept and try and train a simple sequential model.
Later we will change the definition of `get_net()`. To get started we train a linear model with squared loss. Not surprisingly, our linear model will not lead to a competition-winning submission but it provides a sanity check to see whether there is meaningful information in the data. If we cannot do better than random guessing here, then there might be a good chance that we have a data processing bug. And if things work, the linear model will serve as a baseline giving us some intuition about how close the simple model gets to the best reported models, giving us a sense of how much gain we should expect from fancier models.

In [249]:
loss = nn.MSELoss()

in_features = train_features.shape[1]
out_features = 1

def get_net():
#     net = nn.Sequential(nn.Linear(in_features,256), nn.ReLU(), nn.Linear(256,out_features))
    net = nn.Sequential(nn.Linear(in_features, out_features))
    return net

With house prices, as with stock prices, we care about relative quantities more than absolute quantities. Thus [we tend to care more about the relative error $\frac{y - \hat{y}}{y}$] than about the absolute error $y - \hat{y}$. For instance, if our prediction is off by USD 100,000 when estimating the price of a house in Rural Ohio, where the value of a typical house is 125,000 USD, then we are probably doing a horrible job. On the other hand, if we err by this amount in Los Altos Hills, California, this might represent a stunningly accurate prediction (there, the median house price exceeds 4 million USD).

(One way to address this problem is to measure the discrepancy in the logarithm of the price estimates.) In fact, this is also the official error measure used by the competition to evaluate the quality of submissions. After all, a small value $\delta$ for $|\log y - \log \hat{y}| \leq \delta$ translates into $e^{-\delta} \leq \frac{\hat{y}}{y} \leq e^\delta$. This leads to the following root-mean-squared-error between the logarithm of the predicted price and the logarithm of the label price:

$$\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log y_i -\log \hat{y}_i\right)^2}.$$

### LOG RMSE MEASURE

In [250]:
def log_rmse(net, features, labels):
    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    rmse = torch.sqrt(loss(torch.log(clipped_preds), torch.log(labels)))
    return rmse.item()

We will create a helper function to load the dataset, and convert it into a data loader.

In [251]:
def load_array(data_array, batch_size):
    in_dataset = torch.utils.data.TensorDataset(*data_array)
    in_dataloader = torch.utils.data.DataLoader(in_dataset, shuffle=True, batch_size=batch_size)
    return in_dataloader

Helper function for initialisation of weights.

In [252]:
def init_weights(m):
    if type(m)==nn.Linear or type(m)==nn.Conv2d:
        nn.init.xavier_uniform_(m.weight)

Accuracy is not meaningful in case of regression, because you cannot expect the model to make a prediction of a house that is exactly same as that of the test value. Still we I put it here, out of habit. 

In [253]:
def accuracy(y_hat, y):
    return (torch.argmax(y_hat, dim=1)==y).sum().float().mean()

We need to put both our model and inputs onto gpu for faster calculation. Thats what we are doing here. We will use `torch.device('cuda)` for all the calculations.

In [254]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

As simple as they come.

In [255]:
net = get_net()
net

### HYPERPARAMETERS

In [256]:
test_labels = None
batch_size= 64
learning_rate = 0.03
lr = learning_rate
weight_decay = 0
num_epochs = 100

Looking inside train dataloader what we are actually working with, you can totally skip the next cell.

In [264]:
train_dataloader = load_array((train_features, train_labels), batch_size)
with torch.no_grad():
    for X, y in train_dataloader:
        print(X.shape)
        print(X, "\n")
        print(len(y))
        print(y, "\n")
        y_hat = net(X)
        print(y_hat)
        
        print(accuracy(y_hat, y))
        #for x in X:  
        #    y_hat = net(x)
        #    print(x)
        #    print(y_hat)
        #    print(y_hat.shape)
        break

Looks good onto..

## TRAINING

In [258]:
def train(net, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay,batch_size):
    train_ls, test_ls = [], []
    
    net = net.to(device)
    net.apply(init_weights)
    
    train_acc = []
    
    
    optimizer = torch.optim.Adam(net.parameters(), lr = learning_rate, weight_decay=weight_decay)
    
    train_dataloader = load_array((train_features, train_labels),batch_size)
    
    for epoch in range(num_epochs):
        curr_acc = 0
        numer = 0
        for X, y in train_dataloader:
            X = X.to(device)
            y = y.to(device)
            
            y_hat = net(X)
            
            l = loss(y_hat, y.unsqueeze(1))
            curr_acc += accuracy(y_hat, y)
            
#             print(y)
#             print(y_hat)
            
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            numer += len(y)
            
        train_ls.append(log_rmse(net, train_features, train_labels))
        train_acc.append(curr_acc/ numer)
        if test_labels is not None:
            test_ls.append(log_rmse(net, test_features, test_labels))
        print(f'for epoch {epoch} rmse: {train_ls[-1]}, accuracy : {train_acc[-1]}')
    
    return train_ls, test_ls
            

In [259]:
net(train_features).detach()

In [260]:
#train_ls, test_ls =  train(net, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay,batch_size)

### PLOTTING

I'll be using D2l.ai plot for doing the visualisation part. so lets install it.

In [261]:
!pip install -U d2l
import d2l
from d2l import torch
from d2l.torch import *

### Prediction

Technically, its train and prediction both together in one function. Since we would be using it again and again.

In [262]:
def train_and_pred(train_features, test_feature, train_labels, test_data,
num_epochs, lr, weight_decay, batch_size):
    net = get_net()
    train_ls, _ = train(net, train_features, train_labels, None, None,
    num_epochs, lr, weight_decay, batch_size)
    d2l.plot(np.arange(1, num_epochs + 1), [train_ls], xlabel='epoch',
    ylabel='log rmse', xlim=[1, num_epochs], yscale='log')
    print(f'train log rmse {float(train_ls[-1]):f}')
    # Apply the network to the test set
    preds = net(test_features).detach().numpy()
    # Reformat it to export to Kaggle
    test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
    submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
    submission.to_csv('submission.csv', index=False)

In [263]:
train_and_pred(train_features, test_features, train_labels, test_data,
num_epochs, lr, weight_decay, batch_size)


rmse of 4.08202 takes us to 4k rank in leaderboard. Lets try and use a different get_net function. Notice that accuracy is 0.0 because y_hat predicted is never actually equal to the test value y.

# IMPROVING OUR POSITION ON LEADERBOARD

implementation of densenet: mentioned in this model

https://arxiv.org/pdf/2108.00864.pdf

In [266]:
def get_net():
    net = nn.Sequential(nn.Linear(in_features, 256), nn.ReLU(), nn.Linear(256,out_features))
    return net

net = get_net()
net

In [267]:
train_and_pred(train_features, test_features, train_labels, test_data,
num_epochs, lr, weight_decay, batch_size)

Now this takes you within earshot of 1k in leaderboard. Our best score yet!

In [270]:
#would creating a more complex model help?

def get_net():
    net = nn.Sequential(nn.Linear(in_features, 256), nn.ReLU(),nn.Linear(256,64), nn.ReLU(), nn.Linear(64,out_features))
    return net

net = get_net()
net

In [271]:
train_and_pred(train_features, test_features, train_labels, test_data,
num_epochs, lr, weight_decay, batch_size)

not really it didnt quite help. The score is worse than before. Hmm.

This gives me an idea: https://towardsdatascience.com/pitfalls-with-dropout-and-batchnorm-in-regression-problems-39e02ce08e4d, can we use VGG models with head for linear regression?