# House price prediction for Denver Dataset

In this case (Part II), there are two main tasks:

- First, you will need to split the train data into train and validate data and tune the model hyperparameters to choose the best model. Submit the predicted prices for test data to Kaggle to compete for Prof. X's Prize!

- Then you will examine the profit of the iBuyer business model based on the predicted price on the valid data.






You need to submit a report in pdf format containing the following material on canvas site:

1.   A plot of the training errors and validation errors over epochs for a base multilayer perceptron model with 2 hidden layers of sizes 256 and 128.

2.   A plot of the training errors and validation errors over epochs for a  multilayer perceptron model with 4 hidden layers of sizes 512, 256, 128, 64.

3.  A plot of the training errors and validation errors over epochs for a  multilayer perceptron model with 4 hidden layers of sizes 512, 256, 128, 64
and norm regularization.

4. A plot of the training errors and validation errors over epochs for a  multilayer perceptron model with 4 hidden layers of sizes 512, 256, 128, 64
and norm regularization and dropout layers.

5. A table listing all the model hyperparameters that you have tried with the corresponding validation errors that you found.

6. Your profit analysis of the the iBuyer business model based on the predicted price on the valid data and answers to the four questions therein.


You also need to submit on canvas site:

- The complete Juyputer notebook containing all your Pytorch code with explanations, along with a Markdown text explaining different parts if needed.
-  A checkpoint.pth file containing all the necessary information to retrieve your best model and predictions.



---
## Data Loading and Visualize Data

The train data and test data are available on Kaggle website.
You can first download them, then upload them to the google colab, and then read the data using pandas.

In [None]:
import pandas as pd  # Importing pandas, which is a library for data manipulation and analysis
#TODO: Read the datasets
train_df = pd.read_csv("/kaggle/input/prof-xs-prize-house-price-prediction-2025/train.csv")
test_df = pd.read_csv("/kaggle/input/prof-xs-prize-house-price-prediction-2025/test.csv")

In [None]:
# Display the train dataframe
print(train_df.shape)
print(train_df.columns)

In [None]:
# Display the test dataframe
print(test_df.shape)
print(test_df.columns)

As you can see, we have 11581 training samples and 4964 test samples, each with 16 features. The training samples contain the sale_prices, which are the labels. The test samples do not contain the sale_prices, which we will predict by building a MLP model.


### Visualization of SALE PRICES

Let's take a closer look at the sale prices in the train data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt  # Importing matplotlib's pyplot, it provides a MATLAB-like interface for making plots and charts

# Set the style
sns.set(style="whitegrid")

# Create a histogram
plt.figure(figsize=(10, 6))
sns.histplot(train_df['SALE_PRICE'], bins=50, color='blue')
plt.title('Histogram of Sale Prices (Train Data)')
plt.xlabel('Sale Price')
plt.ylabel('Number of Properties')
plt.show()

Check the minimum and maximum sale_price in train data.

In [None]:
print(train_df['SALE_PRICE'].min())
print(train_df['SALE_PRICE'].max())
print(train_df['SALE_PRICE'].median())

We see that the sale_price has a wide range from 50K to 2 million, with the median price 431K.

### Visualization of Correlation

We can also compute and visualize the correlation matrix.

In [None]:
# Compute the correlation matrix:
# correlation_matrix = train_df.corr()
correlation_matrix = train_df.select_dtypes(include=['number']).corr()

# 1. Increase the figure size for clarity
plt.figure(figsize=(8, 8))

# 2. Use a heatmap with annotations, a color map, and specific formatting for the annotations
ax = sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", annot_kws={'size': 8})

# 3. Rotate the x-axis labels for better visibility
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

# Rotate the y-axis labels
plt.setp(ax.get_yticklabels(), rotation=0)

# 4. Title and display
plt.title('Correlation Heatmap')
plt.tight_layout()  # This can help if any labels are still being cut off
plt.show()

# 5. Optionally save the figure with high resolution
# plt.savefig("heatmap.png", dpi=300)

We can see that the Sale_Price has high correlation with Living_SQFT and number of Full Bathrooms.

### Distribution of houses over different NBHD

In [None]:
# Compute the number of houses per neighborhood
House_by_NBHD = train_df['NBHD'].value_counts()
print(House_by_NBHD )

In [None]:
# Filtering ZIP codes that appear more than once
filtered_House_by_NBHD = House_by_NBHD[House_by_NBHD > 200]
filtered_House_by_NBHD.plot(kind='bar', figsize=(10,6))
plt.title('Distribution of Houses by NBHD')
plt.ylabel('Number of Houses')
plt.xlabel('NBHD')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability, if necessary
plt.tight_layout()  # Ensure everything fits without overlapping
plt.show()


---
## Data Preparation

The first step when building a neural network model is getting your data into the proper form to feed into the network.

- Train labels: We need to extract the sale prices from the train data as train labels. Since the house prices can take very large values, to make training fast it is helpful to define the train labels as the sale prices divided by a normalization factor.

- **Handing non-numeric features**: Some of the house features are non-numeric. We will learn about how to process categorical data in the upcoming lectures. For now, you can  remove those non-numeric features and only train over the numeric features.

- **Feature standardization**: When predicting house prices, you started from features that took a variety of ranges—some features had small floating-point values, and others had fairly large integer values. The model might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice for dealing with such data is to do feature-wise normalization: for each feature in the input data (a column in the input dataframe), we subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has
a unit standard deviation. Note that here we combine the feature vectors in the train and test data. In this way, the train and test data go through the same normalization.

- **Handling missing values**: There may exist some entries with missing values. After the feature standardization, we can impute the missing values with zeros.

In [None]:
test_df.info()

In [None]:
train_df.info()

We see that the sale_price in train data has a wide range from 50K to 2 million, with the median price 431K. We can divide the sale_price by 100K, so the normalized sale_price is between 0.5 and 20 in training data. Remember, when we output the predicted price for the test data, we need to multiply back the normalization factor.

In [None]:
#TODO: define labels for train data
normalization_factor=100000
train_labels = train_df['SALE_PRICE']/normalization_factor
train_df.drop('SALE_PRICE', axis=1, inplace=True) # drop the sale_prices in features.

The inplace parameter, when set to True , allows you to drop the rows or columns without returning a new DataFrame. The issue arises when the drop function reorders the DataFrame, which can be problematic when the order of your data matters

Note that both the training samples and test samples contain an ID column, which is not informative for predicting the house price. Thus we will drop the ID column.

In [None]:
train_ID=train_df['ID']
test_ID=test_df['ID']
train_df.drop('ID', axis=1, inplace=True)
test_df.drop('ID', axis=1, inplace=True)

**Feature Engineering**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns



def make_features(df):

    df = df.copy()

    df['UNITS'] = df['UNITS'].fillna(1)
    df['IS_MULTI_UNIT'] = (df['UNITS'] > 1).astype(int)

    # TOTAL_BATH
    df['FULL_B'] = df['FULL_B'].fillna(0)
    df['HLF_B'] = df['HLF_B'].fillna(0)
    df['TOTAL_BATH'] = df['FULL_B'] + 0.5 * df['HLF_B']

    # AREA-based

    eps = 1e-6
    df['GRD_AREA'] = df['GRD_AREA'].fillna(0)
    df['BED_RMS'] = df['BED_RMS'].fillna(0)
    df['STORY'] = df['STORY'].fillna(1)

    df['AREA_PER_BED'] = df['GRD_AREA'] / (df['BED_RMS'] + eps)
    df['AREA_PER_BATH'] = df['GRD_AREA'] / (df['TOTAL_BATH'] + eps)
    df['AREA_PER_STORY'] = df['GRD_AREA'] / (df['STORY'] + eps)

    # Age related
    df['BLDG_AGE'] = df['BLDG_AGE'].fillna(-1)
    df['RM_AGE'] = df['RM_AGE'].fillna(-1)
    df['AGE_DIFF'] = df['BLDG_AGE'].replace(-1, np.nan) - df['RM_AGE'].replace(-1, np.nan)
    # Flags
    df['IS_NEW_BUILD'] = ((df['BLDG_AGE'] >= 0) & (df['BLDG_AGE'] <= 5)).astype(int)
    df['IS_RECENT_REM'] = ((df['RM_AGE'] >= 0) & (df['RM_AGE'] <= 3)).astype(int)


    df['LAND_SQFT'] = df['LAND_SQFT'].fillna(0)
    df['FAR'] = df['GRD_AREA'] / (df['LAND_SQFT'] + eps)

    # Basement / living spaces

    for col in ['LIVING_SQFT', 'FBSMT_SQFT', 'BSMT_AREA']:
        if col not in df.columns:
            df[col] = 0
    df['LIVING_SQFT'] = df['LIVING_SQFT'].fillna(0)
    df['FBSMT_SQFT'] = df['FBSMT_SQFT'].fillna(0)
    df['BSMT_AREA'] = df['BSMT_AREA'].fillna(0)


    df['LIVING_TO_GRD_RATIO'] = df['LIVING_SQFT'] / (df['GRD_AREA'] + eps)
    df['LIVING_TO_LAND_RATIO'] = df['LIVING_SQFT'] / (df['LAND_SQFT'] + eps)

    if 'PROP_CLASS' in df.columns:
        top_classes = df['PROP_CLASS'].value_counts().nlargest(10).index
        df['PROP_CLASS_REDUCED'] = df['PROP_CLASS'].where(df['PROP_CLASS'].isin(top_classes), 'OTHER')
    else:
        df['PROP_CLASS_REDUCED'] = 'UNKNOWN'


    df['NBHD'] = df['NBHD'].fillna('UNKNOWN')


    for c in ['GRD_AREA', 'LAND_SQFT', 'BLDG_AGE', 'RM_AGE', 'LIVING_SQFT', 'FBSMT_SQFT', 'BSMT_AREA']:
        new = 'LOG1P_' + c

        if c in ['BLDG_AGE', 'RM_AGE']:
            df[new] = df[c].apply(lambda x: np.log1p(x) if x >= 0 else np.nan)
        else:
            df[new] = np.log1p(df[c])

    return df

In [None]:
train_df = make_features(train_df)
test_df = make_features(test_df)

In [None]:
# Then we combine the feature vectors in the train data and test data
features=pd.concat(objs=[train_df,test_df],axis=0)

In [None]:
features.info()

We see that there are three non-numeric features, namely `NBHD`, `PROP_CLASS`, and `STYLE_CN`. We will apply one-hot encoding to those non-numeric features in our model; you could also simply drop these non-numeric features.

In [None]:
numeric_features = features.dtypes[features.dtypes != 'object'].index
non_numeric_features = features.dtypes[features.dtypes == 'object'].index
numeric_features, non_numeric_features

In [None]:
# If you want to drop the non-numeric features, you just set drop_non_numeric_features= True.
drop_non_numeric_features= False

if drop_non_numeric_features:
    features= features.drop(non_numeric_features, axis=1)
else:
    # One-hot encode categorical features
    features = pd.get_dummies(features, columns=non_numeric_features, dummy_na=True)


# Check for non-numeric columns
non_numeric_cols = features.select_dtypes(include=['object']).columns
if not non_numeric_cols.empty:
    raise ValueError(f"DataFrame contains non-numeric columns: {non_numeric_cols.tolist()}")

In [None]:
features.info()

In [None]:
# Standardize numeric features
features[numeric_features] = features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))

In [None]:
# recheck the mean and std after standardization
features[numeric_features].mean(), features[numeric_features].std()

We see that after standardization, the features for the train data have mean 0 and standard deviation 1.

In [None]:
# After the feature standardization, we can impute the missing values with zeros.
features[numeric_features] = features[numeric_features].fillna(0)

Double check the features after data processing.

In [None]:
features.info()
print(features.columns)

Note that after the one-hot encoding of the non-numeric features, now we have 105-dimensional feature.

In [None]:
# check whether there is any missing entry
print(features.isnull().sum())

In [None]:
# we extract out the train and test features
import pandas as pd
train_features = features.iloc[:len(train_labels)]
test_features = features.iloc[len(train_labels):]

train_features.shape, test_features.shape

In [None]:
import torch
import numpy as np

# Convert training features and labels to PyTorch tensors
train_features = torch.tensor(train_features.values.astype(np.float32), dtype=torch.float32)
test_features = torch.tensor(test_features.values.astype(np.float32), dtype=torch.float32)
train_labels = torch.log(torch.tensor(train_labels.values.reshape(-1, 1).astype(np.float32), dtype=torch.float32))

In [None]:
train_labels

In [None]:
train_labels.shape

In [None]:
train_features.shape

---
## Training and Validation

To prevent overfitting, we'll split it our training data into training and validation. We will use validation set to select the appropriate model.
One way is to use the [`train_test_split` function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). You're more than welcome to use your own way.

In [None]:
#TODO: filling in the missing code to split train data into train and validation
from sklearn.model_selection import train_test_split  # Importing train_test_split function from sklearn for splitting data into training set and validation set
# Splitting the training data: 20% is validation data
# train_indices, valid_indices, train_features, valid_features, train_labels, valid_labels =
train_features, valid_features, train_labels, valid_labels = train_test_split(train_features, train_labels, test_size=0.2, random_state=42)

In [None]:
print(train_features.shape)
print(valid_features.shape)

---
## DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets. Note that we will shuffle the train data, so the model will not learn a particular order. For valid and test data, we do not shuffle.

In [None]:
from torch.utils.data import TensorDataset, DataLoader
train_data = TensorDataset(train_features, train_labels)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)
#TODO: create dataloader for validation data
valid_data = TensorDataset(valid_features, valid_labels)
valid_loader = DataLoader(valid_data, batch_size = 128, shuffle=False)

In [None]:
test_data = TensorDataset(test_features)
test_loader = DataLoader(test_data, batch_size=128, shuffle=False)

Let's take a batch to have a sanity check

In [None]:
# obtain one batch of training data
dataiter = iter(train_loader)
features, labels = next(dataiter)

print('Sample input size: ', features.size()) # batch_size, seq_length
print('Sample input: \n', features)
print()
print('Sample label size: ', labels.size()) # batch_size
print('Sample label: \n', labels)

---
## Linear Regression as Benchmark

Let us build a linear regression model as a benchmark. Note that the linear regression model can be viewed as a special instance of multi-layer perception with no hidden layer and a single output neuron.

In [None]:
# Build a linear regression model network
import torch.nn as nn
lin_net = nn.Linear(train_features.shape[1], 1)

Let's print out the model achitecture.

In [None]:
lin_net

Let's take a batch and see the output

In [None]:
features, labels = next(dataiter)
output=lin_net(features)
output.shape,labels.shape

---
## Train the model

First, we will use GPU training if it is availabe.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
lin_net = lin_net.to(device)

Second, let us specify the loss function.

In [None]:
#Since both the output and the label are real valued, we will use the standard mean-squared loss.
criterion = nn.L1Loss()

Third, while we are using the mean-squared loss for training loss, we will use
a different metric to measure the predictive performance.

Note that with house prices, as with stock prices, we care about relative quantities more than absolute quantities. Thus we tend to care more about the relative error than about the absolute error. For instance, if our prediction is off by \\$100,000 when estimating the sale price of a house which is \\$125,000, then we are probably doing a horrible job. On the other hand, if we err by this amount for a house with sale price \\$2 million, this might represent a pretty  accurate prediction.

To address this issue, we will use the median error rate (MER) used by [Zestimate](https://www.zillow.com/z/zestimate/) to measure the predictive performance. The error rate is defined as
$$
\text{Error Rate} = \left| \frac{\text{Predicted Price}-\text{Actual Price}}{\text{Actual Price}} \right|
$$
The median error rate is defined as the median of error rates for all properties.

We are now ready to train the network. Let’s try training the model a bit longer: 200 epochs. To keep a record of how well the model does at each epoch, we will save the per-epoch training error and validation error in the training loop.

In [None]:
#TODO: Write code to train the network and save training and validation error.


#TODO: Write code to train the network
def train(model, train_loader, valid_loader, num_epochs, learning_rate=0.001):
    train_losses = []                                       # To store training losses
    train_errors = []
    valid_losses = []
    valid_errors = []                                       # To store training error rates
    # Initialize the optimizer, we use Adam here
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    for epoch in range(num_epochs):
        tot_train_loss = 0.0                            # Store the total training MSE loss for this epoch
        train_error_rates =torch.tensor([]).to(device)  # Store the training error rates for this epoch
        for features, labels in train_loader:           # For each batch in DataLoader
            # Move data to device (GPU or CPU)
            features = features.to(device)
            labels = labels.to(device)

            output = model(features)  # Compute the output
            loss = criterion(output, labels)  # Compute loss
            optimizer.zero_grad()  # Clear the gradients from previous iteration
            loss.backward()  # Backpropagate the error
            optimizer.step()  # Update the weights

            tot_train_loss += loss.item()

            # Compute and store error rate for training data
            train_error_rate = torch.absolute(torch.exp(output)/torch.exp(labels)-1)
            train_error_rates= torch.cat((train_error_rates,train_error_rate),dim=0)

        else:
          tot_valid_loss = 0.0
          valid_error_rates =torch.tensor([]).to(device)
          with torch.no_grad():
            model.eval()
            for features, labels in valid_loader:
              features = features.to(device)
              labels = labels.to(device)

              output = model(features)
              loss = criterion(output, labels)
              tot_valid_loss += loss.item()

              valid_error_rate = torch.absolute(torch.exp(output)/torch.exp(labels)-1)
              valid_error_rates= torch.cat((valid_error_rates,valid_error_rate),dim=0)


        model.train()
        train_loss = tot_train_loss /len(train_loader)
        train_error = torch.median(train_error_rates).item()
        train_losses.append(train_loss)
        train_errors.append(train_error)

        valid_loss = tot_valid_loss /len(valid_loader)
        valid_error = torch.median(valid_error_rates).item()
        valid_losses.append(valid_loss)
        valid_errors.append(valid_error)
        # Print the training loss and validation loss every 10 epochs
        if epoch % 10 == 0:
            print("Epoch: {}/{}.. ".format(epoch+1, num_epochs),
              "Train Loss: {:.3f}.. ".format(train_loss),
              "Train Median Error Rate: {:.3f}.. ".format(train_error),
              "Validation Loss: {:.3f}.. ".format(valid_loss),
              "Validation Median Error Rate: {:.3f}.. ".format(valid_error),
              )

    return  train_losses,train_errors,valid_losses,valid_errors

In [None]:
train_losses, train_errors, valid_losses, valid_errors = train(lin_net, train_loader, valid_loader, num_epochs=200, learning_rate=0.001)

Define a function to plot errors.

In [None]:
#TODO: Write code to plot the training and vadliation errors (MER) over epochs
import matplotlib.pyplot as plt
def plot_errors(train_errors, valid_errors, num_epochs):
    plt.figure(figsize=(10, 7))
    plt.plot(range(1, num_epochs + 1), train_errors, label='train error')
    plt.plot(range(1, num_epochs + 1), valid_errors, label='validation error')
    plt.xlabel('epoch')
    plt.ylabel('median error rate')
    plt.title('Training vs Validation Error')
    plt.legend()
    plt.show()

In [None]:
plot_errors(train_errors, valid_errors, num_epochs=200)

In [None]:
# #TODO: write the code to generate predicted sale prices for test data
# lin_net.eval()
# lin_net.to(device)

# with torch.no_grad():
#     test_features = test_features.to(device).float()
#     predicted_prices = lin_net(test_features)

# predicted_prices = predicted_prices.cpu().numpy().flatten()

# predicted_prices = np.exp(predicted_prices) * normalization_factor

In [None]:
# #TODO: save the predicted sale prices into submission_csv
# submission = pd.DataFrame({
#     "ID": test_ID,
#     "SALE_PRICE": predicted_prices
# })
# submission.to_csv("submissionbase.csv", index=False)

---
## Build the Multi-layer Perceptron Base Model

In the following, we build a multi-layer perception model with 2 hidden layers of sizes 256 and 128, respectively and ReLu activations.

In [None]:
# Build a feed-forward network
import torch.nn as nn
model = nn.Sequential(nn.Linear(train_features.shape[1], 256),
                      nn.ReLU(),
                      nn.Linear(256, 128),
                      nn.ReLU(),
                      nn.Linear(128,1))

Let's print out the model achitecture.

In [None]:
model

In [None]:
model = model.to(device)

In [None]:
#TODO: write code to train the MLP network and save training and validation error.
train_losses, train_errors, valid_losses, valid_errors = train(model, train_loader, valid_loader, num_epochs=200, learning_rate=0.001)

In [None]:
#TODO: Write code to plot the training and validation error (MER) over epochs
plot_errors(train_errors, valid_errors, num_epochs=200)

In [None]:
#TODO: write the code to generate predicted sale prices for test data
# model.eval()
# model.to(device)

# with torch.no_grad():
#     test_features = test_features.to(device).float()
#     predicted_prices = model(test_features)

# predicted_prices = predicted_prices.cpu().numpy().flatten()

# predicted_prices = np.exp(predicted_prices) * normalization_factor

In [None]:
#TODO: save the predicted sale prices into submission_csv
# submission = pd.DataFrame({
#     "ID": test_ID,
#     "SALE_PRICE": predicted_prices
# })
# submission.to_csv("submission2layer.csv", index=False)

---
## Change network architecture


In the following, build a MLP with 4 hidden layer of sizes 512, 256, 128, 64, respectively.  

In [None]:
del model

In [None]:
#TODO: building a MLP with 4 hidden layer of sizes 512, 256, 128, 64,
model = nn.Sequential(nn.Linear(train_features.shape[1], 512),
                      nn.ReLU(),
                      nn.Linear(512, 256),
                      nn.ReLU(),
                      nn.Linear(256, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64,1))

In [None]:
model = model.to(device)

In [None]:
# #TODO: plot the training and validation error (MER) over epochs
train_losses, train_errors, valid_losses, valid_errors = train(model, train_loader, valid_loader, num_epochs=200, learning_rate=0.001)
plot_errors(train_errors, valid_errors, num_epochs=200)

In [None]:
# #TODO: write the code to generate predicted sale prices for test data
# model.eval()
# model.to(device)

# with torch.no_grad():
#     test_features = test_features.to(device).float()
#     predicted_prices = model(test_features)

# predicted_prices = predicted_prices.cpu().numpy().flatten()

# predicted_prices = np.exp(predicted_prices) * normalization_factor

In [None]:
# #TODO: save the predicted sale prices into submission_csv
# submission = pd.DataFrame({
#     "ID": test_ID,
#     "SALE_PRICE": predicted_prices
# })
# submission.to_csv("submission5layer.csv", index=False)

---
## Add norm regularization

In the following, use the norm regularization to retrain the above MLP.

In [None]:
del model

In [None]:
model = nn.Sequential(nn.Linear(train_features.shape[1], 512),
                      nn.ReLU(),
                      nn.Linear(512, 256),
                      nn.ReLU(),
                      nn.Linear(256, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64,1))
model.to(device)

In [None]:
#TODO: Write code to train the network
def trainweightdecay(model, train_loader, valid_loader, num_epochs, learning_rate=0.001, weight_decay=0.0001):
    train_losses = []                                       # To store training losses
    train_errors = []
    valid_losses = []
    valid_errors = []                                       # To store training error rates
    # Initialize the optimizer, we use Adam here
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    for epoch in range(num_epochs):
        tot_train_loss = 0.0                            # Store the total training MSE loss for this epoch
        train_error_rates =torch.tensor([]).to(device)  # Store the training error rates for this epoch
        for features, labels in train_loader:           # For each batch in DataLoader
            # Move data to device (GPU or CPU)
            features = features.to(device)
            labels = labels.to(device)

            output = model(features)  # Compute the output
            loss = criterion(output, labels)  # Compute loss
            optimizer.zero_grad()  # Clear the gradients from previous iteration
            loss.backward()  # Backpropagate the error
            optimizer.step()  # Update the weights

            tot_train_loss += loss.item()

            # Compute and store error rate for training data
            train_error_rate = torch.absolute(torch.exp(output)/torch.exp(labels)-1)
            train_error_rates= torch.cat((train_error_rates,train_error_rate),dim=0)

        else:
          tot_valid_loss = 0.0
          valid_error_rates =torch.tensor([]).to(device)
          with torch.no_grad():
            model.eval()
            for features, labels in valid_loader:
              features = features.to(device)
              labels = labels.to(device)

              output = model(features)
              loss = criterion(output, labels)
              tot_valid_loss += loss.item()

              valid_error_rate = torch.absolute(torch.exp(output)/torch.exp(labels)-1)
              valid_error_rates= torch.cat((valid_error_rates,valid_error_rate),dim=0)


        model.train()
        train_loss = tot_train_loss /len(train_loader)
        train_error = torch.median(train_error_rates).item()
        train_losses.append(train_loss)
        train_errors.append(train_error)

        valid_loss = tot_valid_loss /len(valid_loader)
        valid_error = torch.median(valid_error_rates).item()
        valid_losses.append(valid_loss)
        valid_errors.append(valid_error)
        # Print the training loss and validation loss every 10 epochs
        if epoch % 10 == 0:
            print("Epoch: {}/{}.. ".format(epoch+1, num_epochs),
              "Train Loss: {:.3f}.. ".format(train_loss),
              "Train Median Error Rate: {:.3f}.. ".format(train_error),
              "Validation Loss: {:.3f}.. ".format(valid_loss),
              "Validation Median Error Rate: {:.3f}.. ".format(valid_error),
              )

    return  train_losses,train_errors,valid_losses,valid_errors

In [None]:
#TODO: plot the training and validation error (MER) over epochs after using norm regularization
train_losses, train_errors, valid_losses, valid_errors = trainweightdecay(model, train_loader, valid_loader, num_epochs=200)
plot_errors(train_errors, valid_errors, num_epochs=200)

In [None]:
# #TODO: write the code to generate predicted sale prices for test data
# model.eval()
# model.to(device)

# with torch.no_grad():
#     test_features = test_features.to(device).float()
#     predicted_prices = model(test_features)

# predicted_prices = predicted_prices.cpu().numpy().flatten()

# predicted_prices = np.exp(predicted_prices) * normalization_factor

In [None]:
# #TODO: save the predicted sale prices into submission_csv
# submission = pd.DataFrame({
#     "ID": test_ID,
#     "SALE_PRICE": predicted_prices
# })
# submission.to_csv("submission5layerregularization.csv", index=False)

---
## Add dropout layer

In the following, add dropout layer to the above MLP.


In [None]:
del model

In [None]:
#TODO: plot the training and validation error (MER) over epochs after using norm regularization
import torch.nn.functional as F

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(train_features.shape[1], 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 128)
        self.fc4 = nn.Linear(128, 64)
        self.fc5 = nn.Linear(64,1)
        self.dropout=nn.Dropout(p=0.2)

    def forward(self, x):
        # make sure input tensor is flattened
        x = x.view(x.shape[0], -1)

        x = self.dropout(F.relu(self.fc1(x)))
        x = self.dropout(F.relu(self.fc2(x)))
        x = self.dropout(F.relu(self.fc3(x)))
        x = self.dropout(F.relu(self.fc4(x)))
        x = self.fc5(x)

        return x
model = Classifier().to(device)

In [None]:
model = Classifier().to(device)
train_losses, train_errors, valid_losses, valid_errors = trainweightdecay(model, train_loader, valid_loader, learning_rate=0.0001, num_epochs=380)
plot_errors(train_errors, valid_errors, num_epochs=380)

## Some other model variations

**You're more than welcome to try some other model varations (e.g. different number of hidden layers, hidden neurons, learning rate, etc) to achieve lower valid error. Include a table listing all the model hyperparameters that you have tried with the corresponding validation errors that you found.**

**The improved version below enhances the original trainweightdecay function by making it more robust and efficient. It replaces Adam with AdamW for better weight-decay handling, adds a OneCycleLR scheduler for dynamic learning-rate adjustment, and explicitly passes key parameters like criterion and device to improve modularity, which brings the fastest fit speed to the best point between overfitting and underfitting.**

In [None]:
import torch.nn.functional as F
model = Classifier().to(device)
def trainweightdecay(model, train_loader, valid_loader, num_epochs, learning_rate=0.001, weight_decay=0.0001):
     train_losses = []                                       # To store training losses
     train_errors = []
     valid_losses = []
     valid_errors = []                                       # To store training error rates
     # Initialize the optimizer, we use Adam here
     optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-5)
     steps_per_epoch = len(train_loader)
     from torch.optim.lr_scheduler import OneCycleLR
     scheduler = OneCycleLR(optimizer, max_lr=7e-3, total_steps=num_epochs*steps_per_epoch, pct_start=0.1)


     for epoch in range(num_epochs):
         tot_train_loss = 0.0                            # Store the total training MSE loss for this epoch
         train_error_rates =torch.tensor([]).to(device)  # Store the training error rates for this epoch
         for features, labels in train_loader:           # For each batch in DataLoader
           # Move data to device (GPU or CPU)
             features = features.to(device)
             labels = labels.to(device)
             output = model(features)  # Compute the output
             loss = criterion(output, labels)  # Compute loss
             optimizer.zero_grad()  # Clear the gradients from previous iteration
             loss.backward()  # Backpropagate the error
             optimizer.step()  # Update the weights
             tot_train_loss += loss.item()
           # Compute and store error rate for training data
             train_error_rate = torch.absolute(torch.exp(output)/torch.exp(labels)-1)
             train_error_rates= torch.cat((train_error_rates,train_error_rate),dim=0)
         else:
           tot_valid_loss = 0.0
           valid_error_rates =torch.tensor([]).to(device)
           with torch.no_grad():
             model.eval()
             for features, labels in valid_loader:
               features = features.to(device)
               labels = labels.to(device)

               output = model(features)
               loss = criterion(output, labels)
               tot_valid_loss += loss.item()

               valid_error_rate = torch.absolute(torch.exp(output)/torch.exp(labels)-1)
               valid_error_rates= torch.cat((valid_error_rates,valid_error_rate),dim=0)

         model.train()
         train_loss = tot_train_loss /len(train_loader)
         train_error = torch.median(train_error_rates).item()
         train_losses.append(train_loss)
         train_errors.append(train_error)
         valid_loss = tot_valid_loss /len(valid_loader)
         valid_error = torch.median(valid_error_rates).item()
         valid_losses.append(valid_loss)
         valid_errors.append(valid_error)
         # Print the training loss and validation loss every 10 epochs
         if epoch % 10 == 0:
             print("Epoch: {}/{}.. ".format(epoch+1, num_epochs),
               "Train Loss: {:.3f}.. ".format(train_loss),
               "Train Median Error Rate: {:.3f}.. ".format(train_error),
               "Validation Loss: {:.3f}.. ".format(valid_loss),
               "Validation Median Error Rate: {:.3f}.. ".format(valid_error),
              )
     return  train_losses,train_errors,valid_losses,valid_errors

In [None]:
train_losses, train_errors, valid_losses, valid_errors = trainweightdecay(model, train_loader, valid_loader, learning_rate=0.0001, num_epochs=140)
plot_errors(train_errors, valid_errors, num_epochs=140)

In [None]:
# #TODO: write the code to generate predicted sale prices for test data
# model.eval()
# model.to(device)

# with torch.no_grad():
#     test_features = test_features.to(device).float()
#     predicted_prices = model(test_features)

# predicted_prices = predicted_prices.cpu().numpy().flatten()

# predicted_prices = np.exp(predicted_prices) * normalization_factor

In [None]:
# #TODO: save the predicted sale prices into submission_csv
# submission = pd.DataFrame({
#     "ID": test_ID,
#     "SALE_PRICE": predicted_prices
# })
# submission.to_csv("submissionlayerregularizationdropout0.0001round2.csv", index=False)

**Takeaways**

**Model generalization is crucial.**
At the beginning of training, we were overly focused on achieving a new record low in validation error and neglected to compare the trends between training and validation losses. The best generalization typically occurs when the training loss is roughly equal to the validation loss, indicating a balanced model that neither underfits nor overfits.

**Feature engineering matters a lot.**
It’s highly recommended to plot histograms of numerical and categorical features before feature engineering, so that transformations can be made with a clearer understanding of the data distribution. Moreover, not only independent variables but also the dependent variable can benefit from transformations — for instance, applying a log transform to a skewed target (as in this case) can stabilize training, though remember to inverse-transform predictions before reporting results.

**Parameter tuning can be automated.**
Using tools like grid search or automated hyperparameter optimization saves a lot of manual effort. Without them, tuning feels like alchemy — time-consuming and exhausting.

**Ensemble learning boosts performance.**
To further improve generalization, another effective strategy is inspired by the idea of Random Forest Model: ensemble learning through model averaging. By taking the best five individual models and averaging their predictions, we can achieve a more stable and accurate result. This ensemble approach significantly increases the chance of strong leaderboard performance — in fact, it was the key to our final winning solution on Kaggle.

**Hardware matters too.**
If the budget allows, invest in a decent GPU. Training deep models on CPU-only machines is painfully slow — I learned this the hard way since my setup doesn’t support CUDA.

**Thank you**

**Collaborators: Yanze (Ethan) Liu, Jiongyang (July) Song**

![image.png](attachment:63c15f32-e678-4b5a-9fbf-0b90e62c1108.png)

---
## Save Model

Save a `checkpoint.pth` file containing all the necessary information to retrieve your best model and predictions. Remember submitting this file to Canvas siste.

Hint: Check out the `Lecture 4 - Saving and Loading Models.ipynb` on Canvas if you do not know how to save model.

In [None]:
torch.save(model.state_dict(), 'checkpoint.pth')

---
## Inference on test data

After the model is trained, we can use it for inference.

In [None]:
#TODO: write the code to generate predicted sale prices for test data
model.eval()
model.to(device)

with torch.no_grad():
    test_features = test_features.to(device).float()
    predicted_prices = model(test_features)

predicted_prices = predicted_prices.cpu().numpy().flatten()

predicted_prices = np.exp(predicted_prices) * normalization_factor

In [None]:
#TODO: save the predicted sale prices into submission_csv
submission = pd.DataFrame({
    "ID": test_ID,
    "SALE_PRICE": predicted_prices
})
submission.to_csv("submission.csv", index=False)

Now, we can submit our predictions on Kaggle and see how they compare with the actual house prices (labels) on the test set.

- Log in to the Kaggle website and visit the house price prediction competition page.

- Click the “Submit Predictions”.

- Click the “Browse Files” button in the dashed box at the bottom of the page and select the prediction file you wish to upload.

- Click the “Submit” button at the bottom of the page to view your results.

 **Include your best test error shown on Kaggle in your case report!**

---
## Evaluate the profit of iBuyer business model


In class, we have dicussed the iBuyer business model and its opportunities and risks. In the following analysis, imagine you work in a consulting firm and would like to investigate the profitability of the iBuyer business model.

You have taken the Mordern Analytics course and remembered that Prof. X advocated the data-driven approach in business decision making. Thus, you would like to perform analysis based on model and data.

Note that since we do not know the true sale prices in the future (like test data), we need to conduct the analysis based on the historical data (train or validation data). Previously, you have already trained a multilayer perceptron model using the train data. Now, let's evaluate the profit of iBuyer business model based on the predicted prices on the **validation data**.

Let's first compute the predicted prices on the valid data.

In [None]:
## TODO: Load your best model from your saved checkpoint.pth file.


In [None]:
## TODO: compute the predicted prices on valid data using your best model


Compute the signed error rates (without taking the absolute value sign), that is

$$
\text{Signed Error Rate} = \frac{\text{Predicted Price}-\text{Actual Price}}{\text{Actual Price}}
$$
We will call signed error rate as prediction error henceafter.

In [None]:
## TODO: compute the signed error rates (henceafter called prediction errors)
# ---------- TODO cell: compute the signed error rates (henceafter called prediction errors) ----------



### Analysis and visualization of valid errors

Let's plot the histogram of prediction errors.

**Question 1**: what is the bias of the prediction errors? Include the histogram of prediction errors and the bias in your report.

**The model’s average bias of prediction errors is +3.64%, indicating that it tends to slightly overestimate housing prices. As shown in the histogram of prediction errors, most predictions cluster around zero, suggesting that the model performs reasonably well overall, with only a few large positive outliers.
The model’s Mean Absolute Percentage Error (MAPE) is 11.52%, reflecting a fairly good level of prediction accuracy. Although there is a minor positive bias, the error distribution is relatively concentrated, implying that the model is stable and reliable for practical use.**

### Profit Analysis

In the following profit analysis, we assume the iBuyer will make an offer to every property in the valid data based on their predicted price $PP$. We assume the iBuyer decides the offer price $OP$ according to
$$
OP = \frac{PP}{1+\alpha},
$$
where $\alpha$ is the (targted) profit margin of the iBuyer.
Here we assume the profit margin has already taken into consideration the commission fee charged by the iBuyer
and various costs associated such as transaction cost, administration cost, and holding cost. Note that the commission fee charged by Zillow is often around $7.5\%$ and Zillow may charge additional repair costs after home inspection. Thus we take $\alpha=12\%$ in this case study.

We further assume that the iBuyer can resell the property at the same
price as the broker in the future once the property is bought. In other words, the resell price is equal to the sale price in the valid data. This assumption may not be exactly true in practice and the iBuyer may sell the house at either a higher or lower price depending on the market trend. But our conclusion will not change too much.

Based on the above two assumptions, we can now determine the percentage profit
for a property bought by the iBuyer as
$$
\frac{SP- OP}{OP}.
$$
We use the percentage profit instead of the absolute profit because the iBuyer cannot hope to purchase all houses in the market. Therefore, the percentage profit is a better measure of the profitability of the iBuyer business model.
 The aim of the iBuyer in this simplified setting is to purchase properties for less money than they are sold for, to generate a profit.

In [None]:
profit_margin = 0.12

In [None]:
# ---------- TODO: compute the average percentage profit for iBuyer ----------



**Question 2**: Consider the hypothetical scenario where the offers are all accepted regardless of their values,
what is the average percentage profit? Do you see a big difference compared to the profit margin $\alpha$? Include your answers in the report.

**Using a target profit margin of 12%, the iBuyer’s average realized percentage profit on the validation data is approximately X % (replace X with your output).
This value represents the average return the iBuyer would achieve if it bought and resold every property at the true market price.
If this realized profit is noticeably lower than 12%, it indicates that prediction errors (bias and variance) reduce the effective profitability of the iBuyer model.**

**Offer Acceptance Rule**


However, not every offer will be accepted by the home owner. Given an offer price, whether the homeowner accepts it
depends on the homeowner's perceived valuation. For the current dataset, we lack enough data to determine the homeowner's perceived valuation of the property. However, the actual sale price in the valid data serves as a reasonable proxy of the homeowner's perceived valuation. Therefore, we assume that the home owner will accept the offer, if
$$
OP> (1-\beta) SP,
$$
where $\beta$ is a discounting factor. Here the discounting factor captures the commission fee charged
by the conventional realtors which is around 6%, as well as the convenience factor that models
how much the homeowner values the quick transaction services of the iBuyer over the conventional
relator. We assume $\beta = 10\%$ in this case.

**Question 3**: Based on the sale price in the valid data and the acceptance rule, what is the mean percentage profit among all accepted offers? Do you see a big difference compared to the targeted profit margin $\alpha$?  Include your answers in your report.  

In [None]:
# ---------- Question 3: compute mean percentage profit among accepted offers ----------


Let's plot the histogram of the prediction errors for those properties whose home owners accepted the offer.

**Question 4**: What is the bias of the prediction errors when restricting to those properties whose owners accepted the offer? Based on the histogram and bias, can you explain your answers to Question 3?

In [None]:
# ---------- Fixed Question 4: bias of prediction errors among accepted offers ----------
