# Hedonic Pricing

We often try to predict the price of an asset from its observable characteristics. This is generally called **hedonic pricing**: How do the unit's characteristics determine its market price?

In the lab folder, there are three options: housing prices in pierce_county_house_sales.csv, car prices in cars_hw.csv, and airbnb rental prices in airbnb_hw.csv. If you know of another suitable dataset, please feel free to use that one.

1. Clean the data and perform some EDA and visualization to get to know the data set.
2. Transform your variables --- particularly categorical ones --- for use in your regression analysis.
3. Implement an ~80/~20 train-test split. Put the test data aside.
4. Build some simple linear models that include no transformations or interactions. Fit them, and determine their RMSE and $R^2$ on the both the training and test sets. Which of your models does the best?
5. Include transformations and interactions, and build a more complex model that reflects your ideas about how the features of the asset determine its value. Determine its RMSE and $R^2$ on the training and test sets. How does the more complex model your build compare to the simpler ones?
6. Summarize your results from 1 to 5. Have you learned anything about overfitting and underfitting, or model selection?
7. If you have time, use the sklearn.linear_model.Lasso to regularize your model and select the most predictive features. Which does it select? What are the RMSE and $R^2$? We'll cover the Lasso later in detail in class.



In [6]:
! git clone https://github.com/emilymacris/labs.git


Cloning into 'labs'...
remote: Enumerating objects: 103, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 103 (delta 39), reused 41 (delta 22), pack-reused 32 (from 1)[K
Receiving objects: 100% (103/103), 20.87 MiB | 25.08 MiB/s, done.
Resolving deltas: 100% (41/41), done.


In [1]:
import pandas as pd
import numpy as np




In [7]:
airbnb_df = pd.read_csv("./labs/04_hedonic_pricing/airbnb_hw.csv")


In [10]:
airbnb_df.columns = airbnb_df.columns.str.lower().str.replace(' ', '_')

airbnb_df['host_since'] = pd.to_datetime(airbnb_df['host_since'], errors='coerce')

# Replace inplace=True with direct assignment
airbnb_df['property_type'] = airbnb_df['property_type'].fillna('Unknown')

# Replace inplace=True with direct assignment
airbnb_df['zipcode'] = airbnb_df['zipcode'].fillna(-1)

# Replace inplace=True with direct assignment
airbnb_df['beds'] = airbnb_df['beds'].fillna(airbnb_df['beds'].median())

airbnb_df.dropna(subset=['review_scores_rating', 'review_scores_rating_(bin)'], inplace=True)

airbnb_df['price'] = airbnb_df['price'].replace('[\$,]', '', regex=True).astype(float)

airbnb_df.info(), airbnb_df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 22155 entries, 4 to 30409
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   host_id                     22155 non-null  int64         
 1   host_since                  22155 non-null  datetime64[ns]
 2   name                        22155 non-null  object        
 3   neighbourhood_              22155 non-null  object        
 4   property_type               22155 non-null  object        
 5   review_scores_rating_(bin)  22155 non-null  float64       
 6   room_type                   22155 non-null  object        
 7   zipcode                     22155 non-null  float64       
 8   beds                        22155 non-null  float64       
 9   number_of_records           22155 non-null  int64         
 10  number_of_reviews           22155 non-null  int64         
 11  price                       22155 non-null  float64       


(None,
    host_id host_since                                 name neighbourhood_  \
 4      500 2008-06-26             Trendy Times Square Loft      Manhattan   
 5     1039 2008-07-25   Big Greenpoint 1BD w/ Skyline View       Brooklyn   
 6     1783 2008-08-12                         Amazing Also      Manhattan   
 7     2078 2008-08-15  Colorful, quiet, & near the subway!       Brooklyn   
 8     2339 2008-08-20  East Village Cocoon: 2 Bedroom Flat      Manhattan   
 
   property_type  review_scores_rating_(bin)        room_type  zipcode  beds  \
 4     Apartment                        95.0     Private room  10036.0   3.0   
 5     Apartment                       100.0  Entire home/apt  11222.0   1.0   
 6     Apartment                       100.0  Entire home/apt  10004.0   1.0   
 7     Apartment                        90.0     Private room  11201.0   1.0   
 8     Apartment                        90.0  Entire home/apt  10009.0   2.0   
 
    number_of_records  number_of_reviews 

In [11]:
from sklearn.preprocessing import StandardScaler

airbnb_encoded = pd.get_dummies(airbnb_df, columns=['property_type', 'room_type', 'neighbourhood_'], drop_first=True)

scaler = StandardScaler()
continuous_columns = ['price', 'beds', 'review_scores_rating', 'review_scores_rating_(bin)', 'number_of_reviews', 'number_of_records']
airbnb_encoded[continuous_columns] = scaler.fit_transform(airbnb_encoded[continuous_columns])


airbnb_encoded.info(), airbnb_encoded.head()

<class 'pandas.core.frame.DataFrame'>
Index: 22155 entries, 4 to 30409
Data columns (total 34 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   host_id                        22155 non-null  int64         
 1   host_since                     22155 non-null  datetime64[ns]
 2   name                           22155 non-null  object        
 3   review_scores_rating_(bin)     22155 non-null  float64       
 4   zipcode                        22155 non-null  float64       
 5   beds                           22155 non-null  float64       
 6   number_of_records              22155 non-null  float64       
 7   number_of_reviews              22155 non-null  float64       
 8   price                          22155 non-null  float64       
 9   review_scores_rating           22155 non-null  float64       
 10  property_type_Bed & Breakfast  22155 non-null  bool          
 11  property_type_Boat  

(None,
    host_id host_since                                 name  \
 4      500 2008-06-26             Trendy Times Square Loft   
 5     1039 2008-07-25   Big Greenpoint 1BD w/ Skyline View   
 6     1783 2008-08-12                         Amazing Also   
 7     2078 2008-08-15  Colorful, quiet, & near the subway!   
 8     2339 2008-08-20  East Village Cocoon: 2 Bedroom Flat   
 
    review_scores_rating_(bin)  zipcode      beds  number_of_records  \
 4                    0.470382  10036.0  1.383193                0.0   
 5                    1.022300  11222.0 -0.533894                0.0   
 6                    1.022300  10004.0 -0.533894                0.0   
 7                   -0.081536  11201.0 -0.533894                0.0   
 8                   -0.081536  10009.0  0.424649                0.0   
 
    number_of_reviews     price  review_scores_rating  ...  \
 4           0.925409  2.648685              0.452734  ...   
 5          -0.514464 -0.038887              0.904702  

In [12]:
from sklearn.model_selection import train_test_split

X = airbnb_encoded.drop(columns=['price'])
y = airbnb_encoded['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((17724, 33), (4431, 33), (17724,), (4431,))

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


feature_columns = ['beds', 'review_scores_rating', 'number_of_reviews', 'review_scores_rating_(bin)']


results = {}

# Loop over each feature, fit a model, and evaluate performance
for feature in feature_columns:
    #Reshape to be 2D array for a single feature
    X_train_feature = X_train[[feature]]
    X_test_feature = X_test[[feature]]

    #Initialize and fit the model
    model = LinearRegression()
    model.fit(X_train_feature, y_train)

    #Predict on training and test data
    y_train_pred = model.predict(X_train_feature)
    y_test_pred = model.predict(X_test_feature)

    # Calculate RMSE and R^2 for both training and test sets
    rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
    r2_train = r2_score(y_train, y_train_pred)
    rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
    r2_test = r2_score(y_test, y_test_pred)

    #Store results for this model
    results[feature] = {
        'RMSE_train': rmse_train,
        'R2_train': r2_train,
        'RMSE_test': rmse_test,
        'R2_test': r2_test
    }

results

{'beds': {'RMSE_train': 0.9768387601151115,
  'R2_train': 0.12845612289579544,
  'RMSE_test': 0.7117091656687948,
  'R2_test': 0.1837183954914825},
 'review_scores_rating': {'RMSE_train': 1.0445596006100517,
  'R2_train': 0.003425113450262307,
  'RMSE_test': 0.7856034619682374,
  'R2_test': 0.005415540721233625},
 'number_of_reviews': {'RMSE_train': 1.0459720821580534,
  'R2_train': 0.000728100538028742,
  'RMSE_test': 0.7875091247029067,
  'R2_test': 0.000584499545755901},
 'review_scores_rating_(bin)': {'RMSE_train': 1.0446168001564435,
  'R2_train': 0.0033159666056636894,
  'RMSE_test': 0.7856678312195158,
  'R2_test': 0.005252549383041982}}

In [14]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

#Select features for transformations and interactions
selected_features = ['beds', 'review_scores_rating', 'number_of_reviews']

#Create a pipeline with PolynomialFeatures (degree 2 for interactions) and Linear Regression
pipeline = Pipeline([
    ('poly_features', PolynomialFeatures(degree=2, include_bias=False)),
    ('linear_regression', LinearRegression())
])

#Fit the model on the selected features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

pipeline.fit(X_train_selected, y_train)

#Predict on training and test data
y_train_pred_complex = pipeline.predict(X_train_selected)
y_test_pred_complex = pipeline.predict(X_test_selected)

#Calculate RMSE and R^2 for both training and test sets
rmse_train_complex = np.sqrt(mean_squared_error(y_train, y_train_pred_complex))
r2_train_complex = r2_score(y_train, y_train_pred_complex)
rmse_test_complex = np.sqrt(mean_squared_error(y_test, y_test_pred_complex))
r2_test_complex = r2_score(y_test, y_test_pred_complex)

#Display complex model performance
complex_model_performance = {
    'RMSE_train': rmse_train_complex,
    'R2_train': r2_train_complex,
    'RMSE_test': rmse_test_complex,
    'R2_test': r2_test_complex
}

complex_model_performance

{'RMSE_train': 0.9688558795296899,
 'R2_train': 0.14264270591788453,
 'RMSE_test': 0.7031505924011576,
 'R2_test': 0.20323254639249289}

We started with simple models using single features, but they performed poorly due to underfitting. This means they missed important patterns and had high errors (RMSE) and low accuracy (R²).

We then built a more complex model with extra features and interactions. It did better on the training data, but we needed to check if it overfit, meaning it memorized the training data too well and wouldn't work on new data.

Simple models were underfit, and complex ones risked overfitting. The best model finds a balance, capturing real patterns without memorizing the training data.

This taught me that picking a good model means finding a balance between being too simple and too complex, aiming for accuracy and reliability.

