# Hedonic Pricing

We often try to predict the price of an asset from its observable characteristics. This is generally called **hedonic pricing**: How do the unit's characteristics determine its market price?

In the lab folder, there are three options: housing prices in pierce_county_house_sales.csv, car prices in cars_hw.csv, and airbnb rental prices in airbnb_hw.csv. If you know of another suitable dataset, please feel free to use that one.

1. Clean the data and perform some EDA and visualization to get to know the data set.
2. Transform your variables --- particularly categorical ones --- for use in your regression analysis.
3. Implement an ~80/~20 train-test split. Put the test data aside.
4. Build some simple linear models that include no transformations or interactions. Fit them, and determine their RMSE and $R^2$ on the both the training and test sets. Which of your models does the best?
5. Make partial correlation plots for each of the numeric variables in your model. Do you notice any significant non-linearities?
6. Include transformations and interactions of your variables, and build a more complex model that reflects your ideas about how the features of the asset determine its value. Determine its RMSE and $R^2$ on the training and test sets. How does the more complex model your build compare to the simpler ones?
7. Summarize your results from 1 to 6. Have you learned anything about overfitting and underfitting, or model selection?
8. If you have time, use the sklearn.linear_model.Lasso to regularize your model and select the most predictive features. Which does it select? What are the RMSE and $R^2$? We'll cover the Lasso later in detail in class.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

url = pd.read_csv('https://raw.githubusercontent.com/ezraattisso/linearModels/refs/heads/main/lab/data/cars_hw.csv')

cars_df = pd.DataFrame(url)

cars_df.head()

# Checking for missing values
print("Missing Values:\n", cars_df.isnull().sum())

# Checking data types
print("\nData Types:\n", cars_df.dtypes)


# Basic summary statistics

In [None]:
### Q1

# some exploratory plots being done here.

import matplotlib.pyplot as plt
import seaborn as sns

# histogram

cars_df.hist(figsize=(12, 8), bins=20, edgecolor='black')
plt.suptitle("Distribution of Numeric Features", fontsize=16)
plt.show()


# pairplot

sns.pairplot(cars_df)
plt.show()

# boxplot

plt.figure(figsize=(12, 6))
sns.boxplot(data=cars_df, orient='h')
plt.title("Boxplots of Numeric Features")
plt.show()



In [None]:
### Q2

# checking which columns are categorical here, and then using them in hot encoding.

categorical_cols = cars_df.select_dtypes(include=['object']).columns
print("Categorical Columns:\n", categorical_cols)

cars_df_encoded = pd.get_dummies(cars_df, columns=categorical_cols, drop_first=True)

print(cars_df_encoded.head())


In [5]:
### Q3

from sklearn.model_selection import train_test_split

# Here I'm defining features as X and my target variable as y
X = cars_df_encoded.drop(columns=['Mileage_Run'])
y = cars_df_encoded['Price']

# I'm splitting the data into the 80/20% for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verifying the split
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

Training set size: 780 samples
Test set size: 196 samples


In [6]:
### Q4

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np


X = cars_df_encoded[['Mileage_Run']]  # Predictor variable
y = cars_df_encoded['Price']  # Target variable


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initializing and Fitting model
model = LinearRegression()
model.fit(X_train, y_train)

# these are my predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Ccalculating rmse here
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# calculating R^2 here
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)


print(f"Training RMSE: {rmse_train:.2f}, Training R^2: {r2_train:.2f}")
print(f"Test RMSE: {rmse_test:.2f}, Test R^2: {r2_test:.2f}")

Training RMSE: 369437.23, Training R^2: 0.02
Test RMSE: 330349.98, Test R^2: 0.05


In [None]:
### Q5

import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd


numeric_cols = ['Mileage_Run', 'Year', 'Engine_Size']
# Remove the target variable from predictors if it's in the list
if 'Price' in numeric_cols:
    numeric_cols.remove('Price')

# Create a figure with subplots for each predictor
n_cols = min(2, len(numeric_cols))
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, 4*n_rows))
if n_rows * n_cols > 1:
    axes = axes.flatten()
else:
    axes = [axes]

# Add a constant to the predictors for the intercept term
X_with_const = sm.add_constant(cars_df_encoded[numeric_cols])
model = sm.OLS(y, X_with_const)
results = model.fit()

# Create partial regression plots for each predictor
for i, col in enumerate(numeric_cols):
    # Create partial regression plot
    sm.graphics.plot_partregress(results, col, 'Price', ax=axes[i])
    axes[i].set_title(f'Partial Regression Plot for {col}')
    axes[i].grid(True, linestyle='--', alpha=0.6)

# Remove any unused subplots
for j in range(len(numeric_cols), len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt


numeric_features = ['Mileage_Run', 'Year']
X = cars_df_encoded[numeric_features]
y = cars_df_encoded['Price']

# Step 2: Splitting the data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Create a more complex model with transformations and interactions.
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

feature_names = poly.get_feature_names_out(numeric_features)
print("Transformed features:", feature_names)

# Fitting the model on the transformed data.
complex_model = LinearRegression()
complex_model.fit(X_train_poly, y_train)

# Making predictions and evaluating.
y_train_pred_complex = complex_model.predict(X_train_poly)
y_test_pred_complex = complex_model.predict(X_test_poly)

# Calculating some metrics here.
rmse_train_complex = np.sqrt(mean_squared_error(y_train, y_train_pred_complex))
rmse_test_complex = np.sqrt(mean_squared_error(y_test, y_test_pred_complex))
r2_train_complex = r2_score(y_train, y_train_pred_complex)
r2_test_complex = r2_score(y_test, y_test_pred_complex)

# Displaying the results.
print("\nComplex Model with Transformations and Interactions:")
print(f"Training RMSE: {rmse_train_complex:.2f}, Training R^2: {r2_train_complex:.2f}")
print(f"Test RMSE: {rmse_test_complex:.2f}, Test R^2: {r2_test_complex:.2f}")

# Step 6: Wanting to do some comparisons with the simpler model from earlier.
print("\nComparison with Simple Model:")
print(f"Simple - Training RMSE: {rmse_train:.2f}, Training R^2: {r2_train:.2f}")
print(f"Simple - Test RMSE: {rmse_test:.2f}, Test R^2: {r2_test:.2f}")
print(f"Complex - Training RMSE: {rmse_train_complex:.2f}, Training R^2: {r2_train_complex:.2f}")
print(f"Complex - Test RMSE: {rmse_test_complex:.2f}, Test R^2: {r2_test_complex:.2f}")

## 7.

The more complex model with transformations and interactions likely showed improved metrics (lower RMSE, higher R²) on the training data compared to the simple model. When it comes to under/overfit, the simple linear model with just mileage probably underfits the data. In addition though, the complex model might show signs of overfitting if it performs way better on training data than test data by learning more from the available noise than the patterns that more easily recognizable.