# Problem Set 6: Linear Regression

### Alex Davis, Jeeyoung Kim, Rafi Bayer, Scarlett Hwang


(Due on Thu, Dec 5th 5pm)

In [38]:
import numpy as np
import pandas as pd  
import statsmodels.formula.api as smf
import sklearn.linear_model as slm
import matplotlib.pyplot as plt

## 1. Data description (15pt)

#### 1. Load the data airbnb-seattle-listings-train.csv. Broadly describe the variables you see, their encoding, and discuss if these may be valuable in determining the price.

In [39]:
data = pd.read_csv('airbnb-seattle-listings-train.csv', sep='\t')

In [40]:
data.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2318,https://www.airbnb.com/rooms/2318,20190922030624,2019-09-22,Casa Madrona - Urban Oasis 1 block from the park!,"Gorgeous, architect remodeled, 1917 Dutch Colo...","Casa Madrona is a gorgeous, architect remodele...","Gorgeous, architect remodeled, 1917 Dutch Colo...",none,Madrona is a hidden gem of a neighborhood. It ...,...,f,f,strict_14_with_grace_period,f,f,2,2,0,0,0.21
1,5682,https://www.airbnb.com/rooms/5682,20190922030624,2019-09-22,"Cozy Studio, min. to downtown -WiFi",The Cozy Studio is a perfect launchpad for you...,"Hello fellow travelers, Save some money and ha...",The Cozy Studio is a perfect launchpad for you...,none,,...,f,f,strict_14_with_grace_period,f,t,1,1,0,0,3.99
2,9419,https://www.airbnb.com/rooms/9419,20190922030624,2019-09-22,Glorious sun room w/ memory foambed,This beautiful double room features a magical ...,Our new Sunny space has a private room from th...,This beautiful double room features a magical ...,none,"Lots of restaurants (see our guide book) bars,...",...,f,f,moderate,t,t,8,0,8,0,1.29
3,9460,https://www.airbnb.com/rooms/9460,20190922030624,2019-09-22,Downtown Convention Center B&B -- Free Minibar,Take up a glass of wine and unwind on one of t...,Greetings from Seattle. Thanks for considering...,Take up a glass of wine and unwind on one of t...,none,The apartment is situated at the intersection ...,...,t,f,moderate,f,f,4,3,1,0,3.62
4,9531,https://www.airbnb.com/rooms/9531,20190922030624,2019-09-22,The Adorable Sweet Orange Craftsman,The Sweet Orange is a delightful and spacious ...,"The Sweet Orange invites you to stay and play,...",The Sweet Orange is a delightful and spacious ...,none,The neighborhood is awesome! Just far enough ...,...,f,f,strict_14_with_grace_period,f,t,2,2,0,0,0.39


In [41]:
list(data.columns.values)

['id',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview',
 'notes',
 'transit',
 'access',
 'interaction',
 'house_rules',
 'thumbnail_url',
 'medium_url',
 'picture_url',
 'xl_picture_url',
 'host_id',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'latitude',
 'longitude',
 'is_location_exact',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'amenities',


#### 2. Consider how will you handle missing data.

In [42]:
data.shape

(7540, 106)

In [43]:
null_columns = data.columns[data.isnull().any()]
data[null_columns].isnull().sum()

name                              1
summary                         143
space                          1469
description                      49
neighborhood_overview          2138
notes                          3100
transit                        2242
access                         2522
interaction                    1954
house_rules                    1744
thumbnail_url                  7540
medium_url                     7540
xl_picture_url                 7540
host_name                         1
host_since                        1
host_location                    11
host_about                     1992
host_response_time             1484
host_response_rate             1484
host_acceptance_rate           7540
host_is_superhost                 1
host_thumbnail_url                1
host_picture_url                  1
host_neighbourhood              680
host_listings_count               1
host_total_listings_count         1
host_has_profile_pic              1
host_identity_verified      

Drop missing data by passing 'missing=drop' parameter when using statsmodels

#### 3. Consider which variables you are going to use below.


bedrooms

bathrooms

cleaning_fee

review_scores_rating

security_deposit

## 2. Model (60pt)

#### 1. Either split your data into training and validation sets, or just use cross validation below.

In [44]:
val = data[:-1000]
data = data[-1000:]

#### 2. Develop the models. Report all the variables and how do you clean/encode those. While the exact details are visible in the code, explain the broad choices in text.

In [45]:
# this function takes in a string representing a dollar amount
# it returns the amount as a float, or returns NaN if it was already NaN
def clean_dollars(s):
    if type(s) == type(""):
        return float(s[s.index("$")+1:].replace(",", ""))
    return s

# extracting columns
data = data[["price", "security_deposit", "bedrooms", "bathrooms", "cleaning_fee", "review_scores_rating"]]
val = val[["price", "security_deposit", "bedrooms", "bathrooms", "cleaning_fee", "review_scores_rating"]]

# cleaning extracted columns in data
data["price"] = list(map(clean_dollars, data["price"]))
data["security_deposit"] = list(map(clean_dollars, data["security_deposit"]))
data["cleaning_fee"] = list(map(clean_dollars, data["cleaning_fee"]))

# cleaning extracted columns in validation data
val["price"] = list(map(clean_dollars, val["price"]))
val["security_deposit"] = list(map(clean_dollars, val["security_deposit"]))
val["cleaning_fee"] = list(map(clean_dollars, val["cleaning_fee"]))


#### 3. Report the final number of observations, the estimated coefficient values, adjusted R2, and RMSE set (or k-fold CV) for three models:
#### (a) a simple one that only contains a few most important variables/best predictors. What do you think are 2-3 best predictors in the data?
#### (b) the full model: everything you consider useful.
#### (c) something in between.

### Few variables

In [56]:
m1 = smf.ols(formula = 'price ~ security_deposit + cleaning_fee',
             data=data, missing='drop').fit()
m1.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.035
Model:,OLS,Adj. R-squared:,0.032
Method:,Least Squares,F-statistic:,14.0
Date:,"Sun, 01 Dec 2019",Prob (F-statistic):,1.06e-06
Time:,14:07:17,Log-Likelihood:,-5031.8
No. Observations:,780,AIC:,10070.0
Df Residuals:,777,BIC:,10080.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,125.5954,8.502,14.773,0.000,108.907,142.284
security_deposit,0.0129,0.014,0.939,0.348,-0.014,0.040
cleaning_fee,0.2910,0.063,4.651,0.000,0.168,0.414

0,1,2,3
Omnibus:,833.232,Durbin-Watson:,1.366
Prob(Omnibus):,0.0,Jarque-Bera (JB):,50057.108
Skew:,5.059,Prob(JB):,0.0
Kurtosis:,40.919,Cond. No.,782.0


### Something In Between

In [63]:
m2 = smf.ols(formula = 'price ~ security_deposit + cleaning_fee + review_scores_rating',
             data=data, missing='drop').fit()
m2.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.09
Model:,OLS,Adj. R-squared:,0.083
Method:,Least Squares,F-statistic:,13.38
Date:,"Sun, 01 Dec 2019",Prob (F-statistic):,2.41e-08
Time:,14:22:01,Log-Likelihood:,-2618.7
No. Observations:,412,AIC:,5245.0
Df Residuals:,408,BIC:,5261.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,218.9973,69.535,3.149,0.002,82.305,355.690
security_deposit,0.0208,0.019,1.101,0.271,-0.016,0.058
cleaning_fee,0.7046,0.123,5.725,0.000,0.463,0.946
review_scores_rating,-1.1897,0.739,-1.610,0.108,-2.642,0.263

0,1,2,3
Omnibus:,377.287,Durbin-Watson:,1.194
Prob(Omnibus):,0.0,Jarque-Bera (JB):,8303.657
Skew:,4.022,Prob(JB):,0.0
Kurtosis:,23.47,Cond. No.,4290.0


### All variables

In [78]:
m3 = smf.ols(formula = 'price ~ security_deposit + bedrooms + bathrooms + cleaning_fee + review_scores_rating',
             data=data, missing='drop').fit()
m3.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.128
Model:,OLS,Adj. R-squared:,0.118
Method:,Least Squares,F-statistic:,11.96
Date:,"Sun, 01 Dec 2019",Prob (F-statistic):,8.14e-11
Time:,14:49:32,Log-Likelihood:,-2609.7
No. Observations:,412,AIC:,5231.0
Df Residuals:,406,BIC:,5256.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,183.2277,69.789,2.625,0.009,46.035,320.420
security_deposit,0.0186,0.019,0.995,0.320,-0.018,0.055
bedrooms,20.5506,9.031,2.276,0.023,2.797,38.304
bathrooms,32.8718,16.919,1.943,0.053,-0.388,66.131
cleaning_fee,0.4083,0.140,2.925,0.004,0.134,0.683
review_scores_rating,-1.3060,0.727,-1.797,0.073,-2.735,0.123

0,1,2,3
Omnibus:,398.025,Durbin-Watson:,1.144
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10416.146
Skew:,4.295,Prob(JB):,0.0
Kurtosis:,26.086,Cond. No.,4390.0


#### 4. Interpret the coefficients of the reported models. Again, only interpret the most interesting/important ones, not all of those! Do the coefficient values differ between the models? Can you explain why?

It appears that the Bedrooms and Bathrooms have a big impact on price. This would be expected as Squar footage is normally a strong determinante for an expected price. While cleaning fees, reviews, and security deposit seem to have little effect.

#### 5. Use your models to predict the price. Report RMSE in the table above.

In [72]:
data.dropna(inplace=True)

In [73]:
x = data[['security_deposit', 'bedrooms', 'bathrooms', 'cleaning_fee', 'review_scores_rating']].values
x

array([[250. ,   1. ,   1. ,  60. ,  80. ],
       [200. ,   2. ,   2. ,  95. , 100. ],
       [250. ,   1. ,   1. ,  60. ,  90. ],
       ...,
       [  0. ,   1. ,   1.5,  25. , 100. ],
       [150. ,   1. ,   1.5,  60. , 100. ],
       [500. ,   1. ,   1. ,  85. , 100. ]])

In [74]:
X = np.hstack((np.ones((len(x),1)),x))
X.size

2472

In [75]:
y = data.price.values
y

array([999., 189., 999., 999., 999., 999., 999., 139.,  92., 128.,  62.,
       181., 150.,  65., 135., 160., 175., 180.,  90.,  80., 190.,  99.,
        38.,  40., 200.,  41.,  89., 450., 182., 198., 140., 436., 430.,
        37.,  57.,  31.,  31.,  44.,  32.,  32., 150.,  63.,  55.,  75.,
        75.,  75.,  75., 249., 125., 325., 113., 250.,  75., 125., 180.,
        70., 180., 114., 100.,  75.,  76., 179., 115.,  75., 300.,  50.,
       324., 120.,  79.,  69., 149., 196., 200., 100., 190., 180.,  60.,
        50., 150., 220.,  93., 250.,  50.,  50.,  69., 350.,  80., 225.,
        51., 149.,  74.,  60., 175., 150., 166., 161., 300., 149., 165.,
        75., 175.,  55., 250., 250., 250., 250., 250., 250., 250., 129.,
        87.,  93., 250., 250., 250., 250., 250.,  99., 375., 300., 120.,
       300., 100., 120., 125., 298., 425., 120., 120.,  87., 157., 180.,
       300., 125., 325.,  59., 160., 330., 125.,  49., 200., 110.,  88.,
        88.,  68., 175.,  89.,  80.,  75., 120.,  5

In [76]:
beta = np.linalg.inv(X.T @ X) @ X.T @ y
beta

array([ 1.83227655e+02,  1.85531692e-02,  2.05505931e+01,  3.28717732e+01,
        4.08337699e-01, -1.30599426e+00])

In [79]:
m3.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.128
Model:,OLS,Adj. R-squared:,0.118
Method:,Least Squares,F-statistic:,11.96
Date:,"Sun, 01 Dec 2019",Prob (F-statistic):,8.14e-11
Time:,14:49:38,Log-Likelihood:,-2609.7
No. Observations:,412,AIC:,5231.0
Df Residuals:,406,BIC:,5256.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,183.2277,69.789,2.625,0.009,46.035,320.420
security_deposit,0.0186,0.019,0.995,0.320,-0.018,0.055
bedrooms,20.5506,9.031,2.276,0.023,2.797,38.304
bathrooms,32.8718,16.919,1.943,0.053,-0.388,66.131
cleaning_fee,0.4083,0.140,2.925,0.004,0.134,0.683
review_scores_rating,-1.3060,0.727,-1.797,0.073,-2.735,0.123

0,1,2,3
Omnibus:,398.025,Durbin-Watson:,1.144
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10416.146
Skew:,4.295,Prob(JB):,0.0
Kurtosis:,26.086,Cond. No.,4390.0


### 3. Think (15pt)

#### 1. Does your model do a good job in predicting the price?

Yes, we think it does predict the price fairly well as the predicted y is exactly the same to the coefficient values. 


#### 2. Can your results be used for something interesting, say for research or commercial purposes? What might it be?

Commercial purposes- This could be useful for hotel chains if they are looking at building a new hotel in an area. They can see what a competitive rate would be, and could build accordingly. Travel agents and airlines could use this data to create travel packages. 

Research purposes- Universities could look at housing trends and see what kind of effect this might have on local economies and housing markets. City goverments could take necessary actions to perhaps limit negative effects of the airbnb market.

#### 3. You were predicting the price. Did you include any other price-related variables, such as weekly price or security deposit in your model? What would that mean in terms of the model usablity?

Using price-related variables can sometimes have reverse effects. Security deposits weekly prices might change drastically from a daily price. This is often done to entice longer-term rentals if, for example the weekly price is quoted at a cheaper daily rate. This could have unintended effects on the daily expected price.

#### 4. imagine you are developing this work for a local, or for the national government. Why may government be interested in such a job? Do you see any ethical issues that may rise from your work?

This could potentially be discriminating against Airbnb if local governments can use this data to prove that homeownership and normal rentals are being driven out of the market in favor of a higher-profit Airbnb model. Politics aside, this could be damaging to that business model. Otherwise local governments are interested in property values and census data affecting their districts. If there is the chance that fewer people will be living in their district, in other words favoring the Airbnb model, politicians might see this as disadvantageous to their re-election chances and might want to limit Airbnb and/or favor lower-income housing.

## 4. Additional task (10pt)

#### 1. Load the testing data arbnb-seattle-listings-test.csv. This has exactly the same structure and variables as the original dataset.

#### 2. Compute RMSE on the testing dataset. This is the ultimate goodness measure of your model. Present it prominently in your report.

#### 3. Do not tinker with the model any more. This was your final test.