# Hedonic Pricing

We often try to predict the price of an asset from its observable characteristics. This is generally called **hedonic pricing**: How do the unit's characteristics determine its market price?

In the lab folder, there are three options: housing prices in pierce_county_house_sales.csv, car prices in cars_hw.csv, and airbnb rental prices in airbnb_hw.csv. If you know of another suitable dataset, please feel free to use that one.

1. Clean the data and perform some EDA and visualization to get to know the data set.
2. Transform your variables --- particularly categorical ones --- for use in your regression analysis.
3. Implement an ~80/~20 train-test split. Put the test data aside.
4. Build some simple linear models that include no transformations or interactions. Fit them, and determine their RMSE and $R^2$ on the both the training and test sets. Which of your models does the best?
5. Make partial correlation plots for each of the numeric variables in your model. Do you notice any significant non-linearities?
6. Include transformations and interactions of your variables, and build a more complex model that reflects your ideas about how the features of the asset determine its value. Determine its RMSE and $R^2$ on the training and test sets. How does the more complex model your build compare to the simpler ones?
7. Summarize your results from 1 to 6. Have you learned anything about overfitting and underfitting, or model selection?
8. If you have time, use the sklearn.linear_model.Lasso to regularize your model and select the most predictive features. Which does it select? What are the RMSE and $R^2$? We'll cover the Lasso later in detail in class.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('cars_hw.csv')

In [None]:
# Display basic information about the dataset
print(df.info())
print(df.describe())


# Check for missing values
print(df.isnull().sum())

# Handle missing values (example: fill with mean)
# For demonstration, let's fill missing 'price' with the mean price
if 'price' in df.columns and df['price'].isnull().any():
  df['price'].fillna(df['price'].mean(), inplace=True)


# Visualizations (examples)
plt.figure(figsize=(10, 6))
sns.histplot(df['price'], kde=True)
plt.title('Distribution of Car Prices')
plt.show()


plt.figure(figsize=(10,6))
sns.boxplot(x='year', y='price', data=df)
plt.title('Price vs. Year')
plt.show()


# Further EDA and visualizations (e.g., scatter plots, pair plots, etc.) to understand the relationships between features


# 2. Variable Transformation (example: one-hot encoding for categorical variables)


# Example: One-hot encoding for 'model' if it's a categorical feature.
#  Replace with your actual categorical variables as needed
if 'model' in df.columns:
  df = pd.get_dummies(df, columns=['model'], drop_first=True)


# 3. Train-test split
from sklearn.model_selection import train_test_split

# ... [Your code to prepare features (X) and target variable (y)]
# Example:
X = df.drop('price', axis=1)  # Replace 'price' with your target variable name
y = df['price']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# ... [Rest of your code (steps 4-8)]
