# Graded Case 1 (Part I): House price prediction

In this case (Part I), you will build a multilayer perceptron network to predict the selling price of properties. The dataset consists of all single family houses and condos that were sold in Denver in a given year.

You need to submit the following files:

- A pdf or word document containing the plots of the training errors for the multi-layer perception model and the linear regression model, and the answers to the two questions below.

- The complete Juyputer notebook containing all your Pytorch code with explanations, along with a Markdown text explaining different parts if needed.

To get the test error for your model, you need to submit your predicted prices for test data on Kaggle. Note that in Part I of the case, you do not need to worry about optimizing your model to get the lowest error possible. The Part I will be graded based on your implemention of the base model as specified below.  We will come back to optimize the model and the Kaggle competition in Part II of the case.

---
## Data Loading and Visualize Data

The train data and test data are available on Kaggle website.
You can first download them, then upload them to the google colab, and then read the data using pandas.

In [None]:
import pandas as pd  # Importing pandas, which is a library for data manipulation and analysis
#TODO: Read the datasets
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

Let's take a quick look at the data.

In [None]:
# Display the train dataframe
print(train_df.shape)
print(train_df.columns)

In [None]:
# Display the test dataframe
print(test_df.shape)
print(test_df.columns)

As you can see, we have 11581 training samples and 4964 test samples, each with 16 features. The training samples contain the sale_prices, which are the labels. The test samples do not contain the sale_prices, which we will predict by building a MLP model.

### Visualization of SALE PRICES in train data

Let's take a closer look at the sale prices in the train data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt  # Importing matplotlib's pyplot, it provides a MATLAB-like interface for making plots and charts

# Set the style
sns.set(style="whitegrid")

# Create a histogram
plt.figure(figsize=(10, 6))
sns.histplot(train_df['SALE_PRICE'], bins=50, color='blue')
plt.title('Histogram of Sale Prices (Train Data)')
plt.xlabel('Sale Price')
plt.ylabel('Number of Properties')
plt.show()

In [None]:
print(train_df['SALE_PRICE'].min())
print(train_df['SALE_PRICE'].max())
print(train_df['SALE_PRICE'].median())

We see that the sale_price has a wide range from 50K to 2 million, with the median price 431K.

---
## Data Preparation

The first step when building a neural network model is getting your data into the proper form to feed into the network.

- **Train labels**: We need to extract the sale prices from the train data as train labels. Since the house prices can take very large values, to make training fast it is helpful to define the train labels as the sale prices divided by a normalization factor.

- **Handing non-numeric features**: Some of the house features are non-numeric. We will learn about how to process categorical data in the upcoming lectures. For now, you can  remove those non-numeric features and only train over the numeric features.

- **Feature standardization**: When predicting house prices, you started from features that took a variety of ranges—some features had small floating-point values, and others had fairly large integer values. The model might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice for dealing with such data is to do feature-wise normalization: for each feature in the input data (a column in the input dataframe), we subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has
a unit standard deviation. Note that here we combine the feature vectors in the train and test data. In this way, the train and test data go through the same normalization.

- **Handling missing values**: There may exist some entries with missing values. After the feature standardization, we can impute the missing values with zeros.

We see that the sale_price in train data has a wide range from 50K to 2 million, with the median price 431K. We can divide the sale_price by 100K, so the normalized sale_price is between 0.5 and 20 in training data. Remember, when we output the predicted price for the test data, we need to multiply back the normalization factor.

In [None]:
#TODO: define labels for train data
normalization_factor=100000
train_labels = train_df['SALE_PRICE']/normalization_factor
train_df.drop('SALE_PRICE', axis=1, inplace=True) # drop the sale_prices in features.

The inplace parameter, when set to True , allows you to drop the rows or columns without returning a new DataFrame. The issue arises when the drop function reorders the DataFrame, which can be problematic when the order of your data matters

In [None]:
train_labels.shape

In [None]:
train_labels

Note that both the training samples and test samples contain an ID column, which is not informative for predicting the house price. Thus we will drop the ID column.

In [None]:
train_ID=train_df['ID']
test_ID=test_df['ID']
train_df.drop('ID', axis=1, inplace=True)
test_df.drop('ID', axis=1, inplace=True)

In [None]:
# Then we combine the feature vectors in the train data and test data
features=pd.concat(objs=[train_df,test_df],axis=0)

In [None]:
numeric_features = features.dtypes[features.dtypes != 'object'].index
non_numeric_features = features.dtypes[features.dtypes == 'object'].index
numeric_features, non_numeric_features

We see that there are three non-numeric features, namely `NBHD`, `PROP_CLASS`, and `STYLE_CN`. We will apply one-hot encoding to those non-numeric features in our model; you could also simply drop these non-numeric features.

In [None]:
features= features.drop(non_numeric_features, axis=1)

In [None]:
# Standardize numeric features
features[numeric_features] = features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))

In [None]:
# recheck the mean and std after standardization
features[numeric_features].mean(), features[numeric_features].std()

We see that after standardization, the features for the train data have mean 0 and standard deviation 1.

In [None]:
# After the feature standardization, we can impute the missing values with zeros.
features[numeric_features] = features[numeric_features].fillna(0)

Double check the features after data processing.

In [None]:
features.info()
print(features.columns)

Now, we are left with 12 features.

In [None]:
# check whether there is any missing entry
print(features.isnull().sum())

In [None]:
#TODO: Write code to construct feature vectors for train and test data after data preparation.
train_features = features.iloc[:len(train_labels)]
test_features = features.iloc[len(train_labels):]

In [None]:
train_features.shape, test_features.shape

Finally, we convert features and labels to PyTorch tensors.

In [None]:
import torch
import numpy as np

# Convert training features and labels to PyTorch tensors
train_features = torch.tensor(train_features.values.astype(np.float32), dtype=torch.float32)
test_features = torch.tensor(test_features.values.astype(np.float32), dtype=torch.float32)
train_labels = torch.tensor(train_labels.values.reshape(-1, 1).astype(np.float32), dtype=torch.float32)

---
## DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets. Note that we will shuffle the train data, so the model will not learn a particular order. For test data, we do not shuffle.

In [None]:
from torch.utils.data import TensorDataset, DataLoader
#  Create DataLoaders and batch our train data
train_data = TensorDataset(train_features, train_labels)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)

In [None]:
#TODO: Create DataLoaders and batch our test data
test_data =TensorDataset(test_features)
test_loader = DataLoader(test_data, batch_size=128, shuffle=False)

Let's take a batch to have a sanity check

In [None]:
# obtain one batch of training data
dataiter = iter(train_loader)
features, labels = next(dataiter)

print('Sample input size: ', features.size()) # batch_size, seq_length
print('Sample input: \n', features)
print()
print('Sample label size: ', labels.size()) # batch_size
print('Sample label: \n', labels)

---
## Linear Regression as Benchmark

Let us build a linear regression model as a benchmark. Note that the linear regression model can be viewed as a special instance of multi-layer perception with no hidden layer and a single output neuron.

In [None]:
#TODO: Build a linear regression model network
import torch.nn as nn
lin_net = nn.Linear(train_features.shape[1], 1)

Let's print out the model achitecture.

In [None]:
lin_net

Let's take a batch and see the output


In [None]:
features, labels = next(dataiter)
output=lin_net(features)
output.shape,labels.shape

## Train the model

First, we will use GPU training if it is availabe.

In [None]:
#TODO: use GPU for training if it is availabe
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
lin_net = lin_net.to(device)

Second, let us specify the loss function.

In [None]:
#TODO: specify the loss function for training
criterion = nn.MSELoss()

We are now ready to train the network.

Note that with house prices, as with stock prices, we care about relative quantities more than absolute quantities. Thus we tend to care more about the relative error than about the absolute error. For instance, if our prediction is off by \\$100,000 when estimating the sale price of a house which is \\$125,000, then we are probably doing a horrible job. On the other hand, if we err by this amount for a house with sale price \\$2 million, this might represent a pretty  accurate prediction.

To this end, we will use the median error rate (MER) used by [Zestimate](https://www.zillow.com/z/zestimate/) to measure the predictive performance. The error rate is defined as
$$
\text{Error Rate} = \left| \frac{\text{Predicted Price}-\text{Actual Price}}{\text{Actual Price}} \right|
$$
The median error rate is defined as the median of error rates for all properties.

In [None]:
#TODO: Write code to train the network
def train(model, train_loader, num_epochs, learning_rate=0.001):
    train_losses = []                                       # To store training losses
    train_errors = []                                       # To store training error rates
    # Initialize the optimizer, we use Adam here
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    for epoch in range(num_epochs):
        tot_train_loss = 0.0                            # Store the total training MSE loss for this epoch
        train_error_rates =torch.tensor([]).to(device)  # Store the training error rates for this epoch
        for features, labels in train_loader:           # For each batch in DataLoader
            # Move data to device (GPU or CPU)
            features = features.to(device)
            labels = labels.to(device)

            output = model(features)  # Compute the output
            loss = criterion(output, labels)  # Compute loss
            optimizer.zero_grad()  # Clear the gradients from previous iteration
            loss.backward()  # Backpropagate the error
            optimizer.step()  # Update the weights

            tot_train_loss += loss.item()

            # Compute and store error rate for training data
            train_error_rate = torch.absolute(output/labels-1)
            train_error_rates= torch.cat((train_error_rates,train_error_rate),dim=0)


        train_loss = tot_train_loss /len(train_loader)
        train_error = torch.median(train_error_rates).item()
        train_losses.append(train_loss)
        train_errors.append(train_error)
        # Print the training loss and validation loss every 10 epochs
        if epoch % 10 == 0:
            print("Epoch: {}/{}.. ".format(epoch+1, num_epochs),
              "Train Loss: {:.3f}.. ".format(train_loss),
              "Train Median Error Rate: {:.3f}.. ".format(train_error),
              )

    return  train_losses,train_errors

In [None]:
train_losses, train_errors = train(lin_net, train_loader, num_epochs=500, learning_rate=0.001)

Plot the training error (MER) over epochs

In [None]:
#TODO: Write code to plot the training error (MER) over epochs
import matplotlib.pyplot as plt
def plot_errors(train_errors, num_epochs):
    plt.figure(figsize=(10, 7))
    plt.plot(range(1, num_epochs + 1), train_errors, label='train')
    plt.xlabel('epoch')
    plt.ylabel('median error rate')
    plt.legend()
    plt.show()

In [None]:
plot_errors(train_errors, num_epochs=500)

---
## Build the Multi-layer Perceptron Base Model

In the following, we build a multi-layer perception model.

In [None]:
#TODO: Build a multi-layer perception neural network with 2 hidden layers of sizes 256 and 128, respectively and ReLu activations
import torch.nn as nn
model = nn.Sequential(nn.Linear(train_features.shape[1], 256),
                      nn.ReLU(),
                      nn.Linear(256, 128),
                      nn.ReLU(),
                      nn.Linear(128,1))

In [None]:
model=model.to(device)

In [None]:
#TODO: write code to train the MLP network
train_losses, train_errors = train(model, train_loader, num_epochs=500, learning_rate=0.001)

In [None]:
#TODO: Write code to plot the training error (MER) over epochs
plot_errors(train_errors, num_epochs=500)

**Question 1**: What are your final training errors of the multilayer perception model and the linear regression model?

We can see that the training error of the linear regression model is around 22% and training error of the multilayer perception model is around 10%.

---
## Inference on test data

After the MLP model is trained, we can use it for inference.

 We need to remember to set the model in inference mode with model.eval(). You'll also want to turn off autograd with the torch.no_grad() context.

In [None]:
#TODO: write the code to generate predicted sale prices for test data
# Test our network for one batch
model.eval()
dataiter = iter(test_loader)
features,= next(dataiter)
features=features.to(device)
output = model(features)

In [None]:
def test(model, test_loader):
    model.eval()  # Set the model to evaluation mode
    # In evaluation (or testing) mode, we don't want any parameter updates. We want the model to give
    # the final output based on current parameter values.
    pred_labels = torch.tensor([]).to(device)
    with torch.no_grad():
      for features, in test_loader:
            # Move data to device (GPU or CPU)
            features = features.to(device)
            output = model(features)  # Compute the output
            # store prediction results
            pred_labels = torch.cat( (pred_labels,output),dim=0)
    return  pred_labels

In [None]:
pred_labels =test(model,test_loader)

Remember when predict the price, we need to multiply back the normalization factor

In [None]:
pred_labels=pred_labels*normalization_factor

In [None]:
pred_labels[:20]

In [None]:
data = {'ID': test_ID, 'SALE_PRICE': pred_labels.cpu().numpy().squeeze()}
pred_df=pd.DataFrame(data)
pred_df.head()

Save the dataframe to CSV file without index by setting index=False.

In [None]:
#TODO: save the predicted sale prices into submission_csv
pred_df.to_csv('submission.csv', index=False)

Now, we can submit our predictions on Kaggle and see how they compare with the actual house prices (labels) on the test set.

- Log in to the Kaggle website and visit the house price prediction competition page.

- Click the “Submit Predictions” button.

- Click the “Upload Submission File” button in the dashed box at the bottom of the page and select the prediction file you wish to upload.

- Click the “Make Submission” button at the bottom of the page to view your results.

**Question 2**: What is the test error shown on Kaggle? How does it compare with the train error?

It turns out that the test error is around 15%, which is much higher than the train error.

---
## Evaluate test error

**Note**: To evaluate test error, you need to have access to the true sale prices for test error. No need to run the following code!!! Instead, you will get the test error by submitting your predictions on Kaggle as described above!

In [None]:
df_sol=pd.read_csv('solution.csv')
df_sol.head()


In [None]:
test_labels=df_sol['SALE_PRICE'].values

In [None]:
test_labels.shape

In [None]:
# Compute and store median error rate for validation data
test_errors=pred_labels.cpu().numpy().squeeze()/test_labels-1
test_errors.shape

In [None]:
median_test_error=np.median(np.absolute(test_errors)).item()

In [None]:
median_test_error

## Conclusion

We see that the median test error is around 15%. In Part II of Case 1, you will be asked to vary model architecture or optimization algorithms to see if you can squeeze out a lower test error.