# Hedonic Pricing

We often try to predict the price of an asset from its observable characteristics. This is generally called **hedonic pricing**: How do the unit's characteristics determine its market price?

In the lab folder, there are three options: housing prices in pierce_county_house_sales.csv, car prices in cars_hw.csv, and airbnb rental prices in airbnb_hw.csv. If you know of another suitable dataset, please feel free to use that one.

1. Clean the data and perform some EDA and visualization to get to know the data set.
2. Transform your variables --- particularly categorical ones --- for use in your regression analysis.
3. Implement an ~80/~20 train-test split. Put the test data aside.
4. Build some simple linear models that include no transformations or interactions. Fit them, and determine their RMSE and $R^2$ on the both the training and test sets. Which of your models does the best?
5. Include transformations and interactions, and build a more complex model that reflects your ideas about how the features of the asset determine its value. Determine its RMSE and $R^2$ on the training and test sets. How does the more complex model your build compare to the simpler ones?
6. Summarize your results from 1 to 5. Have you learned anything about overfitting and underfitting, or model selection?
7. If you have time, use the sklearn.linear_model.Lasso to regularize your model and select the most predictive features. Which does it select? What are the RMSE and $R^2$? We'll cover the Lasso later in detail in class.



In [24]:
import pandas as pd
import numpy as np

df = pd.read_csv('./airbnb_hw.csv', low_memory=False)
print( df.shape, '\n')
df.head()

df['Price'] = df['Price'].str.replace(',','') 
df['Price'] = pd.to_numeric(df['Price'],errors='coerce') 
print('Total missing: ', sum(df['Price'].isnull()))

X = pd.get_dummies(df['Neighbourhood '], dtype='int', drop_first = True)

from sklearn import linear_model
from sklearn.model_selection import train_test_split

y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2, random_state=100)

reg = linear_model.LinearRegression(fit_intercept = True).fit(X_train, y_train)

results = pd.DataFrame({'variable':reg.feature_names_in_, 'coefficient': reg.coef_}) # Regression coefficients
print(results)

y_hat = reg.predict(X_test)
print('Rsq: ', reg.score(X_test,y_test)) # R2
rmse = np.sqrt( np.mean( (y_test - y_hat)**2 ))
print('RMSE: ', rmse) # R2

y_hat = reg.predict(X_train)
print('Rsq: ', reg.score(X_train,y_train)) # R2
rmse = np.sqrt( np.mean( (y_train - y_hat)**2 ))
print('RMSE: ', rmse) # R2

#it seems like the train and test models are somewhat similar. Although, the training model has a slightly lower RMSE which means the values are a bit tighter.



(30478, 13) 

Total missing:  0
        variable  coefficient
0       Brooklyn    32.936408
1      Manhattan    99.813274
2         Queens     4.282124
3  Staten Island    77.188159
Rsq:  0.030673261290627973
RMSE:  226.8281340018276
Rsq:  0.037963986587503995
RMSE:  185.12846105729506


In [27]:
y = np.log(df['Price'])
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2, random_state=100)

reg = linear_model.LinearRegression(fit_intercept = True).fit(X_train, y_train)

results = pd.DataFrame({'variable':reg.feature_names_in_, 'coefficient': reg.coef_}) # Regression coefficients
print(results)

y_hat = reg.predict(X_test)
print('Rsq: ', reg.score(X_test,y_test)) # R2
rmse = np.sqrt( np.mean( (y_test - y_hat)**2 ))
print('RMSE: ', rmse) # R2

y_hat = reg.predict(X_train)
print('Rsq: ', reg.score(X_train,y_train)) # R2
rmse = np.sqrt( np.mean( (y_train - y_hat)**2 ))
print('RMSE: ', rmse) # R2

#for this one, i did the log of price. The regression is similar it seems from training to test. Since it's the log, it's also more condensed than the full regression previously done.
#I think overall I learned that while I didn't try other trials on different ratios of test sizes, there can still be differences when running multiple kinds of regressions.
#Although some didn't look the way I expected for example the log coefficients and R2 and RMSE, there is still information to be gleaned from each one.

        variable  coefficient
0       Brooklyn     0.372101
1      Manhattan     0.771994
2         Queens     0.158732
3  Staten Island     0.233826
Rsq:  0.13302680647143117
RMSE:  0.5922451412721083
Rsq:  0.13586951944201642
RMSE:  0.5815317317008505
