# Hedonic Pricing

We often try to predict the price of an asset from its observable characteristics. This is generally called **hedonic pricing**: How do the unit's characteristics determine its market price?

In the lab folder, there are three options: housing prices in pierce_county_house_sales.csv, car prices in cars_hw.csv, and airbnb rental prices in airbnb_hw.csv. If you know of another suitable dataset, please feel free to use that one.

1. Clean the data and perform some EDA and visualization to get to know the data set.
2. Transform your variables --- particularly categorical ones --- for use in your regression analysis.
3. Implement an ~80/~20 train-test split. Put the test data aside.
4. Build some simple linear models that include no transformations or interactions. Fit them, and determine their RMSE and $R^2$ on the both the training and test sets. Which of your models does the best?
5. Include transformations and interactions, and build a more complex model that reflects your ideas about how the features of the asset determine its value. Determine its RMSE and $R^2$ on the training and test sets. How does the more complex model your build compare to the simpler ones?
6. Summarize your results from 1 to 5. Have you learned anything about overfitting and underfitting, or model selection?
7. If you have time, use the sklearn.linear_model.Lasso to regularize your model and select the most predictive features. Which does it select? What are the RMSE and $R^2$? We'll cover the Lasso later in detail in class.



In [None]:
#Q1 

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

cars1 = pd.read_csv('/Users/borayadiul/Desktop/labs/04_hedonic_pricing/cars_hw.csv')
cars1.describe()
#print(cars1.dtypes, '\n')


cars1['No_of_Owners']=cars1['No_of_Owners'].replace('1st', '1')
cars1['No_of_Owners']=cars1['No_of_Owners'].replace('2nd', '2')
cars1['No_of_Owners']=cars1['No_of_Owners'].replace('3rd', '3')
cars1['No_of_Owners'] = pd.to_numeric(cars1['No_of_Owners'])
#cars1.head()
print(cars1.dtypes, '\n')
cars1 = cars1.rename(columns = {'Seating_Capacity': "Seats", "No_of_Owners": "Num_Owners", "Mileage_Run" : "Mileage"})
cars1.head()

cars2 = cars1.loc[:,['Make', "Make_Year", "Color", "Body_Type", "Mileage", "Num_Owners", "Seats", "Fuel_Type", "Transmission","Transmission_Type", "Price"]]
cars2.head()

print(cars2.isnull().values.any().sum())

cars2.describe()

sns.kdeplot(data=cars2, x='Seats')


In [None]:
sns.kdeplot(data=cars2, x='Num_Owners')


In [None]:
sns.kdeplot(data=cars2, x='Make_Year')

In [None]:
sns.kdeplot(data=cars2, x='Mileage')

In [None]:
sns.kdeplot(data=cars2, x='Price')

In [None]:
sns.kdeplot(data=cars2, x='Make_Year', hue = "Make")

In [None]:
cars2.groupby('Make')['Price'].describe()

In [None]:
sns.scatterplot(x=cars2['Mileage'],y=cars2['Price'])

In [None]:
sns.boxplot(data=cars2)

In [145]:
# Q2 

make_dum = pd.get_dummies(cars2['Make'])
color_dum = pd.get_dummies(cars2['Color'])
body_dum = pd.get_dummies(cars2['Body_Type'])
fuel_dum = pd.get_dummies(cars2['Fuel_Type'])
transmis_dum = pd.get_dummies(cars2['Transmission'])
transmisType_dum = pd.get_dummies(cars2['Transmission_Type'])

cars2['Mileage_arcsin'] = np.arcsinh(cars2['Mileage'])
cars2['Price_arcsin'] = np.arcsinh(cars2['Price'])
cars2.head()


X = pd.concat([make_dum,color_dum,body_dum,fuel_dum,transmis_dum,transmisType_dum,cars2],axis=1)
X.head()
#list(X.columns)





Unnamed: 0,Chevrolet,Datsun,Ford,Honda,Hyundai,Jeep,Kia,MG Motors,Mahindra,Maruti Suzuki,...,Body_Type,Mileage,Num_Owners,Seats,Fuel_Type,Transmission,Transmission_Type,Price,Mileage_arcsin,Price_arcsin
0,False,False,False,False,False,False,False,False,False,False,...,sedan,44611,1,5,diesel,7-Speed,Automatic,657000,11.398883,14.088586
1,False,False,False,False,True,False,False,False,False,False,...,crossover,20305,1,5,petrol,5-Speed,Manual,682000,10.61177,14.125932
2,False,False,False,True,False,False,False,False,False,False,...,suv,29540,2,5,petrol,5-Speed,Manual,793000,10.986648,14.276726
3,False,False,False,False,False,False,False,False,False,False,...,hatchback,35680,1,5,petrol,5-Speed,Manual,414000,11.175493,13.626768
4,False,False,False,False,True,False,False,False,False,False,...,hatchback,25126,1,5,petrol,5-Speed,Manual,515000,10.824806,13.845069


In [None]:
sns.scatterplot(data=cars2,x='Mileage_arcsin', y = 'Price_arcsin')




In [186]:
cars2.head()

Unnamed: 0,Make,Make_Year,Color,Body_Type,Mileage,Num_Owners,Seats,Fuel_Type,Transmission,Transmission_Type,Price,Mileage_arcsin,Price_arcsin
0,Volkswagen,2017,silver,sedan,44611,1,5,diesel,7-Speed,Automatic,657000,11.398883,14.088586
1,Hyundai,2016,red,crossover,20305,1,5,petrol,5-Speed,Manual,682000,10.61177,14.125932
2,Honda,2019,white,suv,29540,2,5,petrol,5-Speed,Manual,793000,10.986648,14.276726
3,Renault,2017,bronze,hatchback,35680,1,5,petrol,5-Speed,Manual,414000,11.175493,13.626768
4,Hyundai,2017,orange,hatchback,25126,1,5,petrol,5-Speed,Manual,515000,10.824806,13.845069


In [None]:
#Q3
from sklearn import linear_model
from sklearn.model_selection import train_test_split

Y = cars2['Price']
X = cars2['Mileage']
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=.2,random_state=100)

L_train = len(y_train)
L_test = len(y_test)

Z_train =([X_train.X])
Z_test = ([X_test[X]])

#reg4 = linear_model.LinearRegression(fit_intercept=False).fit(Z_train,y_train)

In [190]:
# Randomize the rows in the dataframe:
N = cars2.shape[0]
cars2 = cars2.sample(frac=1, random_state=100) # randomize the order in which data appears
train_size = int(.8*N)

# How to do the split as needed:
cars2_train = cars2[0:train_size]
cars2_train = cars2_train['Price']

cars2_test = cars2[train_size:]
cars2_test = cars2_test['Price']

In [203]:
var_n = cars2['Mileage'] # Select variables
X_train_n = cars2_train.loc[:,var_n] # Process training covariates
#reg_n = linear_model.LinearRegression().fit(X_train_n,y_train) # Run regression
#X_test_n = cars2_test.loc[:,var_n] # Process test covariates
#y_hat = reg_n.predict(X_test_n)
#print('Numeric only Rsq: ', reg_n.score(X_test_n,y_test)) # R2
#rmse_n = np.sqrt( np.mean( (y_test - y_hat)**2 ))
#print('Numeric only RMSE: ', rmse_n) # R2

ValueError: No axis named 1 for object type Series

In [195]:
var_n = ['Mileage']
print(var_n)

['Mileage']


In [130]:
#Q4 
from sklearn.linear_model import LinearRegression

reg = LinearRegression(fit_intercept=False).fit(make_dum, Y) # Fit the linear model
results = pd.DataFrame({'variable':reg.feature_names_in_, 'coefficient': reg.coef_}) # Regression coefficients
print('R-squared: ', reg.score(make_dum, Y)) # R squared measure
results


R-squared:  0.5203564146667528


Unnamed: 0,variable,coefficient
0,Chevrolet,453500.0
1,Datsun,289666.7
2,Ford,721173.1
3,Honda,798972.6
4,Hyundai,691891.8
5,Jeep,1499500.0
6,Kia,1614750.0
7,MG Motors,1869457.0
8,Mahindra,1100167.0
9,Maruti Suzuki,588785.0


In [131]:
reg2 = LinearRegression(fit_intercept=False).fit(body_dum, Y)
results2 = pd.DataFrame({'variable':reg2.feature_names_in_, 'coefficient': reg2.coef_})
print('R-squared: ', reg2.score(body_dum, Y)) 
results2

R-squared:  0.4668264087273377


Unnamed: 0,variable,coefficient
0,crossover,705095.2
1,hatchback,533977.3
2,muv,626421.1
3,sedan,809784.1
4,suv,1176495.0


In [134]:
reg3 = LinearRegression(fit_intercept=False).fit(transmisType_dum, Y)
results3 = pd.DataFrame({'variable':reg3.feature_names_in_, 'coefficient': reg3.coef_})
print('R-squared: ', reg3.score(transmisType_dum, Y))
results3

R-squared:  0.0300400184977615


Unnamed: 0,variable,coefficient
0,Automatic,845518.939394
1,Manual,702272.47191


In [None]:
Z_train =([X_trainpd.get_dummies(cars2['Body_Type']))
Z_test = ([X_test['Body_Type']])

In [149]:
list(X.columns)

['Chevrolet',
 'Datsun',
 'Ford',
 'Honda',
 'Hyundai',
 'Jeep',
 'Kia',
 'MG Motors',
 'Mahindra',
 'Maruti Suzuki',
 'Nissan',
 'Renault',
 'Skoda',
 'Tata',
 'Toyota',
 'Volkswagen',
 'beige',
 'black',
 'blue',
 'bronze',
 'brown',
 'golden',
 'green',
 'grey',
 'maroon',
 'orange',
 'purple',
 'red',
 'silver',
 'white',
 'yellow',
 'crossover',
 'hatchback',
 'muv',
 'sedan',
 'suv',
 'diesel',
 'petrol',
 'petrol+cng',
 '4-Speed',
 '5-Speed',
 '6-Speed',
 '7-Speed',
 'CVT',
 'Automatic',
 'Manual',
 'Make',
 'Make_Year',
 'Color',
 'Body_Type',
 'Mileage',
 'Num_Owners',
 'Seats',
 'Fuel_Type',
 'Transmission',
 'Transmission_Type',
 'Price',
 'Mileage_arcsin',
 'Price_arcsin']

In [155]:
Z_train =([X_train['Mileage']])
Z_test = ([X_test['Mileage']])

reg4 = linear_model.LinearRegression(fit_intercept=False).fit(Z_train,y_train)

ValueError: Found input variables with inconsistent numbers of samples: [1, 780]