In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

#import statsmodels formula
import statsmodels.formula.api as smf


# Økonometri A  

## Problem Set 6  

### Hedonic price regressions  

In problem set 6, we estimate a regression model relating house prices to house characteristics. This model is an example of the so-called hedonic price regression which is widely used in economics.

A hedonic regression for house prices usually includes house characteristics and community attributes as explanatory variables. In this case, the model's coefficients may be interpreted as the implicit price of each characteristic. Hedonic price models can be useful for estimating the price of characteristics for which there are no markets. For example, we do not observe a price for clean air, but we may be able to estimate the (implicit) price effect of clean air on house prices.

The data used in problem set 6 contains a random sample of apartment sales in Copenhagen in 2005. We will focus on apartments sold in the four neighborhoods of Copenhagen K, N, V, and Ø. For each apartment, we observe the sales price and a range of apartment characteristics which are all specific to the year of 2005. We consider these data as a cross-section, exploiting variation in prices and characteristics across apartments to estimate the parameters of the hedonic price regression.

The STATA file `PS6.dta` includes the following variables for a total of 988 apartment sales in 2005:

- Sales price in 2005-DKK (**price**)
- Apartment size in square meters (**m2**)
- Number of rooms (**rooms**)
- Number of toilets (**toilets**)
- Floor location of apartment (**floor**)
- Apartment location in Copenhagen (**location**)
- Number of apartment units in the building (**building_units**)
- Building age (**age**)


---

### Group work  

Discuss the following questions in groups:

#### Question 1.

Consider the simple hedonic model:

$$
\log(price_i) = \beta_0 + \beta_1 m2_i + \beta_3 rooms_i + \beta_4 toilets_i + u_i \tag{1}
$$

 Discuss if it is reasonable to assume that assumptions MLR.1–MLR.5 are satisfied for model (1).



**Your answer:**


#### Question 2.
 The hedonic model can be extended with dummy variables to investigate if there are level differences between apartments in different neighbourhoods of Copenhagen. Note that the variable **location** takes on four categories:
 _KBH K_, _KBH N, KBH O_ and _KBH V_

 The extended model with dummy variables could look like this:
 
 \begin{align}
 		\log(price_i)=\, &\beta_0 +\delta_1 KbhN_i + \delta_2 KbhO_i + \delta_3 KbhV_i \\
 			      &+\beta_1 m2_i + \beta_2 rooms_i + \beta_3 toilets_i 
 			      + \epsilon_i 
 	\end{align}\tag{2}

1. Explain what a dummy variable is.

2. Why don't we include dummies for _Kbh K_ model (2)? 

3. In terms of the model parameters, what is the intercept for apartments located in Kbh V?

**Your answer:**

### Question 3. 
The hedonic model can further be extended with interaction terms to see if the model parameters differ across locations in Copenhagen.

\begin{align}
 		\log(price_i)=\, &\beta_0 +\delta_1 KbhN_i + \delta_2 KbhO_i + \delta_3 KbhV_i \\
 			      &+\beta_1 m2_i  +\delta_4 KbhN_i\cdot m2_i  + \delta_5 KbhO_i\cdot m2_i  + \delta_6 KbhV_i\cdot m2_i  \\
                  &+\beta_2 rooms_i +\delta_7 KbhN_i\cdot rooms_i  + \delta_8 KbhO_i\cdot rooms_i + \delta_9 KbhV_i\cdot rooms_i \\
 		          &+ \beta_3 toilets_i +\delta_{10} KbhN_i\cdot toilets_i  + \delta_{11} KbhO_i\cdot toilets_i + \delta_{12} KbhV_i\cdot toilets_i \\
 			      & + \epsilon_i 
 	\end{align}\tag{3}

1. Which coefficients in model (3) describe the interaction terms?

2. In terms of the model parameters, what is the expected log(price) for a  $ 75 m^2$ apartment in _Kbh V_ with three rooms and one toilet?

3. What is the expected log(price) for an identical apartment in _Kbh K_?


#### Question 3.
 How can you test for level differences in apartment prices across Copenhagen? How can you test if the (implicit) price of an additional square meter is different across locations in Copenhagen? Formulate the null and alternative hypotheses (be precise!)


**Your answer:**


---

### Python exercises


#### Task 0: Warm-up

When using Statsmodels to perform hypothesis testing in this problem set, it will be useful for you to know a little about two very useful Python features, namely **f-strings** and **list comprehension**.



##### f-strings 
f-strings lets you plug Python variables directly into your strings. Consider the example below:
```py
name = 'Daniel'
age = 30 + 1
greeting = f'My name is {name}, great to meet you! I am {age} years old'

print(greeting)
```

```txt
>> My name is Daniel, great to meet you! I am 31 years old
```

So by simply adding an 'f' in front of your strings, you get the superpower of being able to include the contents of variables directly in your strings.




##### List comprehension
List comprehension is another useful tool that allows you quickly to generate transformations of existing lists without needing a for loop:
```py
numbers = [1, 2, 3, 4]
numbers2 = [2 * num for num in numbers] # <- list comprehension

print(numbers)
```

```txt
>> [2, 4, 6, 8]
```
List comprehension can also be used to work with strings. And you can loop over multiple lists in the same list comprehension statement. Consider this example:

```py
letters = ['x', 'y', 'z']
numbers = [1, 2, 3, 4]

variables = [f'{let}_{num}' for let in letters for num in numbers]

print(variables)
```

```txt
>> ['x_1', 'x_2', 'x_3', 'x_4', 'y_1', 'y_2', 'y_3', 'y_4', 'z_1', 'z_2', 'z_3', 'z_4']

**Task:** Use your knowledge of list comprehension and f-strings to generate this output from the two lists in the code cell below:

```py
['C(location)[T.KBH N]:m2',
 'C(location)[T.KBH N]:rooms',
 'C(location)[T.KBH N]:toilets',
 'C(location)[T.KBH O]:m2',
 'C(location)[T.KBH O]:rooms',
 'C(location)[T.KBH O]:toilets',
 'C(location)[T.KBH V]:m2',
 'C(location)[T.KBH V]:rooms',
 'C(location)[T.KBH V]:toilets']
```


In [None]:
locs = ['KBH N', 'KBH O', 'KBH V']
vars = ['m2', 'rooms', 'toilets']
loc_vars = ... # Your list comprehension code here

#### Task 1.
**Load the data set into pandas** and provide a descriptive analysis of sales prices and apartment sizes across different locations in Copenhagen.

_Hint 1:_ Remember from Problem Set 1 that we can compute grouped summary statistics by using the `.groupby()` method in a DataFrame.

_Hint 2:_ If you don't want all the summary statistics, but just the mean, you can use `.mean()` instead of `.describe()`. This is especially useful when grouping on some category, as the resulting table has a tendency of becoming very big otherwise. Similar useful methods are `.std()`, `.count()`, `.max()`, `.min()` and `.median()`

_Hint 3:_ You can use `df['location'].value_counts()` to see the distribution of the observations across the four locations in the dataset.


**Your code:**

**Your answer:**


#### Task 2.
**Assume model (1) satisfies MLR.1–MLR.5.** Estimate model (1) by OLS and comment on the parameter estimates. How much of the variation in $\log(price)$ can the regression model explain? Is the sign of $\hat{\beta}_3$ surprising?


**Your code:**

**Your answer:**

---
#### --- INTERMISSION --- 
Before we move on with the next exercises, you should learn about a feature in `statsmodels` which makes it a little simpler to run regressions.

So far we have manually been choosing our $X$-matrix, added a constant and chosen our $y$-vector. Actually, we can skip all these steps and instead just specify our model using a text string. This is a bit more akin to how one would do the analyis in a software package such as Stata or R. 

To achieve this, we are going to import a new module from statsmodels:

```py
import statsmodels.formula.api as smf
```

Now, try to solve exercise 2 again using this module. Use the command
```py
df['logprice'] = np.log(df.price)

results = smf.ols('logprice ~ m2 + rooms + toilets', data = df).fit()
print(results.summary())
```

As you can see from the output, statsmodels makes sure to automatically add a constnat. If you want to add more explanatory variables to the specification, you just extend the string.

**Your code:**

---


#### Task 3.
- Estimate model (2)

- Interpret the regression results. What is the estimated price differences across neighbourhoods in percentages?

Make sure you understand both hints below before you solve the task.





[ _Hint:_ Assuming your DataFrame is named `df`, this code may be of help when constructing the dummy variables:
 ```py
 df['KbhN'] = df.location == "KBH N"
 ```
 This creates a new column called `KbhN` that is filled with True or False values depending on whether each observation satisfies the condition. When including this variable in a statsmodels regression, it will automatically be interpreted as a dummy (True is 1, False is 0). 
 
 However, if you want to, you can add this line of code to convert the boolean array to dummies using this code:
 ```py
 df['KbhN'] = df['KbhN'].astype('int')
 ```
]


 
 [_Hint 2:_ Actually you don't have to generate the dummy variables manually.
 
 If you are using formulas to specify your model in statsmodels (as we learned above), you can skip the process described in the former hint entirely and simply add `C(location)` to your formula string to automatically add dummies based on the location categories to your regression model. Statsmodels also automatically leaves out one category to avoid the dummy trap.
 
  You can read more about this feature at https://www.statsmodels.org/stable/example_formulas.html#categorical-variables
  ]


**Your code:**

**Your answer:**

### Task 4

Use an $F$-test if there are level differences in apartment prices across neighbourhoods of Copenhagen. Be precise in formulating your null and alternative hypothesis.

- Why are we using an $F$-test?

- What assumptions are necessary for the validity of the test?

- Based on your test results, do you prefer model (1) or model (2)?



_Hint:_ You can calculate the F-test by hand or use the built-in `.f_test()` function of your statsmodels OLS results object. For example, if you want to test if the coefficiants for $m2$ and $rooms$ are both equal to zero, you would use the code

```py
ftest = results.f_test(['m2', 'rooms'])

print(ftest)
```

**Your answer:**


#### Task 5.
**Interaction terms for apartment size and location.** 
1. Estimate the full model (3) with all the interaction terms. Interpret the regression results. Are all the estimated coefficiants individually significant?

3. Test whether there are interaction effects across locations in Copenhagen (test if all the interaction terms are jointly 0)

4. Test specifically whether the effect of the number of rooms (`rooms`) differs across locations.

5. Test specifically whether the price effect of apartment size (`m2`) differs across locations.

6. If you want to, try to estimate a new specification based on your test insights to better explain the data.

_Hint:_ If you use formulas to specify your regression model in statsmodels, you can interact two terms in the formula by using the `*` operator instead of `+`. 

For example, if I wanted to interact `m2` with `rooms`, I could use the code:
```py
results = smf.ols('logprice ~ toilets + m2 * rooms', data = df).fit()
```
You can also interact a variable with multiple variables by grouping them in parantheses. For example:
```py
results = smf.ols('logprice ~ m2*(rooms + toilets)', data = df).fit()
```

Note that when interacting two variables, statsmodels automatically adds the two variables individually (that is, un-interacted) to the model specification too.


_Hint:_ Use the output from the warmup exercise (on f-strings and list comprehension) to conduct the first F-test. 

**5.1 code:**

**5.1 answer:**

**5.2 code:**

**5.2 answer:**

**5.3 code:**

**5.3 answer:**

**5.4 code:**

**5.4 answer:**


#### Task 6.
**Quadratic model for apartment size.** 
Model (1) assumes that apartment prices depend linearly on apartment size, but this may be a restrictive assumption. 

You are therefore asked to estimate a new model including a quadratic term of $m2$. Moreover, the model should allow the effect of the linear and quadratic **m2** terms to be different across locations, while making the simplifying assumption that **rooms** and **toilets** have the same effect on sales prices across locations. 

- Generate the variable for squared m2 and estimate the new model
- Test the new model against a restricted model where all slope parameters are the same across locations in Copenhagen. 
- Which model do you prefer in this case?
- What is the expected effect of increasing $m2$ by one unit on prices?

_Hint:_ It might be helpful to scale the squared term by a factor of e.g. 1000 (how will this affect your estimates?)

**Your code:**

**Your answer:**