In [1]:
import pandas as pd
import pathlib
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder

In [2]:
path = pathlib.Path().cwd().parents[1] / 'CSVs' / 'Kings County Housing.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,sqft_basement,yr_built,yr_renovated,zipcode
0,1,221900,3,1.0,1180,5650,1.0,0,0,0,1955,0,98178
1,2,538000,3,2.25,2570,7242,2.0,0,0,400,1951,1991,98125
2,3,180000,2,1.0,770,10000,1.0,0,0,0,1933,0,98028
3,4,604000,4,3.0,1960,5000,1.0,0,0,910,1965,0,98136
4,5,510000,3,2.0,1680,8080,1.0,0,0,0,1987,0,98074


- a) Where is King County? Use the zip codes if you are unsure.
    - It is in Washington

- b) How many observations are in the dataset? What does 1 row correspond to?
    - There are 21613 observations in the dataset
    - Each row corresponds to a house in the county

In [3]:
df.shape

(21613, 13)

- c) What are the median statistics for price, bedrooms, bathrooms, square foot of living space, and year built?
    - Price: 450000
    - Bedrooms: 3
    - Bathrooms: 2.25
    - Square Foot of Living Space: 1910
    - Year Built: 1975


In [4]:
df[['price', 'bedrooms', 'bathrooms', 'sqft_living', 'yr_built']].describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,yr_built
count,21613.0,21613.0,21613.0,21613.0,21613.0
mean,540088.1,3.370842,2.114757,2079.899736,1971.005136
std,367127.2,0.930062,0.770163,918.440897,29.373411
min,75000.0,0.0,0.0,290.0,1900.0
25%,321950.0,3.0,1.75,1427.0,1951.0
50%,450000.0,3.0,2.25,1910.0,1975.0
75%,645000.0,4.0,2.5,2550.0,1997.0
max,7700000.0,33.0,8.0,13540.0,2015.0


- d) Run the regression: $$P rice = a + b ∗Bedrooms$$
    - Write a full sentence explaining the coefficient on bedrooms.
        - The coefficient on bedrooms is about $121716$ which means that every additional bedroom corresponds to an increase of house price by about $121716$ dollars.
    - Is the coefficient statistically significant? What is the 95% confidence interval on the coefficient on bedrooms? Interpret the interval.
        - The coefficient is statistically significant because the p-value is less than $0.05$.
        - The 95% confidence interval on the coefficient on bedrooms is $[1.17 * 10^5, 1.27 * 10^5]$ which means that we are $95\%$ confident that the true coefficient on bedrooms is between $117000$ and $127000$.
    - If a house has 2 bedrooms, what does the one variable model predict the price will be?
        - Approximately $\$373234.61$
    - Is the relationship between bedrooms and price necessarily causal?
        - The relationship between bedrooms and price is not necessarily causal. There could be other factors that affect the price of a house that are not accounted for in this model.
    - Interpert the $R^2$ value of this model.
        - The $R^2$ value of this model is $0.095$ which means that $9.5\%$ of the variation in price can be explained by the number of bedrooms.

In [5]:
X = df['bedrooms']
Y = df['price']

X_full = sm.add_constant(X)

model = sm.OLS(Y, X_full)
results = model.fit()

display(results.summary())
f'The equation of the line is: y = {results.params[0]:.2f} + {results.params[1]:.2f}x'

0,1,2,3
Dep. Variable:,price,R-squared:,0.095
Model:,OLS,Adj. R-squared:,0.095
Method:,Least Squares,F-statistic:,2271.0
Date:,"Tue, 21 Nov 2023",Prob (F-statistic):,0.0
Time:,22:46:18,Log-Likelihood:,-306520.0
No. Observations:,21613,AIC:,613100.0
Df Residuals:,21611,BIC:,613100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.298e+05,8931.866,14.533,0.000,1.12e+05,1.47e+05
bedrooms,1.217e+05,2554.304,47.651,0.000,1.17e+05,1.27e+05

0,1,2,3
Omnibus:,18859.406,Durbin-Watson:,1.961
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1199044.96
Skew:,3.904,Prob(JB):,0.0
Kurtosis:,38.644,Cond. No.,14.2


'The equation of the line is: y = 129802.36 + 121716.13x'

In [6]:
results.params[0] + results.params[1] * 2

373234.6093419398

- e) Run the regression of price on bedrooms and living square footage: $$\text{Price} = a + b * \text{Bedrooms} + c * \text{Sqft\_living}$$
    - Write a full sentence explaining the coefficient on bedrooms. How has it changed? Why might it have changed?
        - The coefficient of bedrooms is now negative (from $1.298*10^5 \text{ to } -5.707 * 10^4$), which corresponds to a decrease in the price of the house when the number of bedrooms increases. The coefficient might have changed because of the addition of a new variable or because more bedrooms could actually decrease the price of a house. In addition to this, there may be a different variable that increases the value of a house.
    - How has the $R^2$ changed from the first model?
        - The $R^2$ value has increased from $0.095$ to $0.507$ which means that 50.7% of the variability in housing prices can be explained by the number of bedrooms and the square footage of living space.
    - What does the model predict for the price of a 2 bedroom, 1000 square foot apartment?
        - 279,284.53
    - What does the model predict for the price of a 3 bedroom, 1000 square foot apartment?
        - 222,217.77


In [7]:
X = df[['bedrooms', 'sqft_living']]
Y = df['price']

X_full = sm.add_constant(X)

model = sm.OLS(Y, X_full)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.507
Model:,OLS,Adj. R-squared:,0.507
Method:,Least Squares,F-statistic:,11100.0
Date:,"Tue, 21 Nov 2023",Prob (F-statistic):,0.0
Time:,22:46:18,Log-Likelihood:,-299970.0
No. Observations:,21613,AIC:,599900.0
Df Residuals:,21610,BIC:,600000.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,7.947e+04,6604.764,12.032,0.000,6.65e+04,9.24e+04
bedrooms,-5.707e+04,2308.223,-24.723,0.000,-6.16e+04,-5.25e+04
sqft_living,313.9487,2.337,134.314,0.000,309.367,318.530

0,1,2,3
Omnibus:,14423.033,Durbin-Watson:,1.986
Prob(Omnibus):,0.0,Jarque-Bera (JB):,492253.321
Skew:,2.732,Prob(JB):,0.0
Kurtosis:,25.732,Cond. No.,8870.0


In [8]:
results.params

const          79469.359075
bedrooms      -57066.758923
sqft_living      313.948686
dtype: float64

In [9]:
print(f'What does the model predict for the price of a 2 bedroom, 1000 square foot apartment?\n{results.params[0] + results.params[1] * 2 + results.params[2] * 1000:,.2f}')
print(f'What does the model predict for the price of a 3 bedroom, 1000 square foot apartment?\n{results.params[0] + results.params[1] * 3 + results.params[2] * 1000:,.2f}')

What does the model predict for the price of a 2 bedroom, 1000 square foot apartment?
279,284.53
What does the model predict for the price of a 3 bedroom, 1000 square foot apartment?
222,217.77


- f) Add dummies for zip code to your second model and run the regression: $$\text{Price} = a + b * \text{Bedrooms} + c * \text{Sqft\_living} + d * \text{Zip}$$ You should have 70 zip dummies. You do not need to interpret them, just include them.
    - What is the $R^2$ of this model? Write a full sentence.
        - The $R^2$ value of this model is $0.738$ which means that $73.8\%$ of the variation in price can be explained by the number of bedrooms, square footage of living space, and zip code.
    - What is the coefficient on bedrooms? How does it compare to the other models? Is it statistically significant?
        - The coefficient on bedrooms is $-4.471 * 10^4$ which is greater than the previous model. It is statistically significant because the p-value is less than 0.05.
    - Suppose we wanted to use this model to make a casual statement about the effect of bedrooms. Write a full sentence about the assumption we would have to make.
        - We would have to assume that if we were to keep the square footage of living space and zip code constant, then the number of bedrooms would have a causal effect on the price of the house.


In [10]:
dummies = pd.get_dummies(df['zipcode'], drop_first=True, dtype=int)
X = df[['bedrooms', 'sqft_living']].join(dummies)
Y = df['price']

X_full = sm.add_constant(X)

model = sm.OLS(Y, X_full)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.738
Model:,OLS,Adj. R-squared:,0.737
Method:,Least Squares,F-statistic:,856.1
Date:,"Tue, 21 Nov 2023",Prob (F-statistic):,0.0
Time:,22:46:18,Log-Likelihood:,-293120.0
No. Observations:,21613,AIC:,586400.0
Df Residuals:,21541,BIC:,587000.0
Df Model:,71,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-9.778e+04,1.1e+04,-8.878,0.000,-1.19e+05,-7.62e+04
bedrooms,-4.471e+04,1715.670,-26.058,0.000,-4.81e+04,-4.13e+04
sqft_living,278.7524,1.855,150.311,0.000,275.117,282.387
98002,2.705e+04,1.66e+04,1.629,0.103,-5503.018,5.96e+04
98003,4294.6204,1.5e+04,0.287,0.774,-2.5e+04,3.36e+04
98004,8.151e+05,1.46e+04,56.004,0.000,7.87e+05,8.44e+05
98005,3.395e+05,1.76e+04,19.292,0.000,3.05e+05,3.74e+05
98006,3.241e+05,1.31e+04,24.770,0.000,2.98e+05,3.5e+05
98007,2.772e+05,1.87e+04,14.837,0.000,2.41e+05,3.14e+05

0,1,2,3
Omnibus:,20425.505,Durbin-Watson:,1.982
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2479617.959
Skew:,4.209,Prob(JB):,0.0
Kurtosis:,54.794,Cond. No.,149000.0


- g) Run one more model to evaluate the effect of bedrooms on price, picking some other variable(s) for controls. What variables did you include? Write the full estimating equation, and include a screenshot of your results. What coefficient for bedrooms do you find?
    - I included square footage of living space, bathrooms, and number of floors as controls, while still keeping bedrooms. The full estimating equation is: 
        - $\text{Price} = 7.467 * 10^4 - 5.785 * 10^4 * \text{Bedrooms} + 309.3932 * \text{Sqft\_living} + 7853.5216 * \text{Bathrooms} + 200.4972 * \text{Floors}$
    - The coefficient for bedrooms is $-5.785 * 10^4$ which is close to the coefficient in the previous two models.

In [11]:
X = df[['bedrooms', 'sqft_living', 'bathrooms', 'floors']]


Y = df['price']

X_full = sm.add_constant(X)

model = sm.OLS(Y, X_full)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.507
Model:,OLS,Adj. R-squared:,0.507
Method:,Least Squares,F-statistic:,5554.0
Date:,"Tue, 21 Nov 2023",Prob (F-statistic):,0.0
Time:,22:46:18,Log-Likelihood:,-299960.0
No. Observations:,21613,AIC:,599900.0
Df Residuals:,21608,BIC:,600000.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,7.467e+04,7679.122,9.724,0.000,5.96e+04,8.97e+04
bedrooms,-5.785e+04,2347.323,-24.644,0.000,-6.24e+04,-5.32e+04
sqft_living,309.3932,3.087,100.228,0.000,303.343,315.444
bathrooms,7853.5216,3814.223,2.059,0.040,377.363,1.53e+04
floors,200.4972,3775.505,0.053,0.958,-7199.772,7600.766

0,1,2,3
Omnibus:,14450.413,Durbin-Watson:,1.985
Prob(Omnibus):,0.0,Jarque-Bera (JB):,494760.943
Skew:,2.739,Prob(JB):,0.0
Kurtosis:,25.79,Cond. No.,10400.0
